modeling prompt adherence in student essays - hltrivince/talks/acl14-essay.pdf · modeling prompt...

119
1 Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas

Upload: lyxuyen

Post on 27-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

1

Modeling Prompt Adherence

in Student Essays

Isaac Persing and Vincent Ng

Human Language Technology Research Institute

University of Texas at Dallas

2

3

Automated Essay Scoring

bull Important educational application of NLP

bull Related research on essay scoring

ndash Grammatical errors

ndash Coherence

ndash Thesis clarity

ndash Organization

ndash Prompt adherence

3

4

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

4

5

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

ndash An essay with a high prompt adherence score

consistently remains on topic introduced by the

prompt and is free of irrelevant digressions

5

6

Goal

bull Develop a model for scoring the prompt

adherence of student essays

6

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 2: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

2

3

Automated Essay Scoring

bull Important educational application of NLP

bull Related research on essay scoring

ndash Grammatical errors

ndash Coherence

ndash Thesis clarity

ndash Organization

ndash Prompt adherence

3

4

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

4

5

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

ndash An essay with a high prompt adherence score

consistently remains on topic introduced by the

prompt and is free of irrelevant digressions

5

6

Goal

bull Develop a model for scoring the prompt

adherence of student essays

6

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 3: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

3

Automated Essay Scoring

bull Important educational application of NLP

bull Related research on essay scoring

ndash Grammatical errors

ndash Coherence

ndash Thesis clarity

ndash Organization

ndash Prompt adherence

3

4

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

4

5

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

ndash An essay with a high prompt adherence score

consistently remains on topic introduced by the

prompt and is free of irrelevant digressions

5

6

Goal

bull Develop a model for scoring the prompt

adherence of student essays

6

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 4: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

4

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

4

5

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

ndash An essay with a high prompt adherence score

consistently remains on topic introduced by the

prompt and is free of irrelevant digressions

5

6

Goal

bull Develop a model for scoring the prompt

adherence of student essays

6

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 5: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

5

What is Prompt Adherence

bull refers to how related an essayrsquos content is to

the prompt for which it was written

ndash An essay with a high prompt adherence score

consistently remains on topic introduced by the

prompt and is free of irrelevant digressions

5

6

Goal

bull Develop a model for scoring the prompt

adherence of student essays

6

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 6: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

6

Goal

bull Develop a model for scoring the prompt

adherence of student essays

6

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 7: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

7

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 8: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

8

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 9: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

9

Related Work on Prompt Adherence Scoring

bull Off-topic sentence detection (Higgins et al 2004)

bull Off-topic essay detection (Higgins et al 2006)

bull Off-topic essay detection with short prompts

(Louis and Higgins 2010)

Binary decision (off-topicon-topic)

Knowledge-lean

ndash Features derived from semantic similarity measures

bull Random indexing (RI) and Content Vector Analysis (CVA)

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 10: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

10

What are the differences between

our work and previous work

10

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 11: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

11

What are the differences between

our work and previous work

11

Task

Score can range from 1-4 points

Supervised prompt adherence scoring

Approach

Feature-rich

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 12: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

12

bull Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

Plan for the Talk

12

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 13: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

13

Plan for the Talk

Corpus and Annotations

bull Approach for scoring prompt adherence

bull Evaluation

13

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 14: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

14

Selecting a Corpus

bull International Corpus of Learner English (ICLE)

ndash 45 million words in more than 6000 essays

ndash Written by university undergraduates who are

learners of English as a foreign language

ndash Mostly (91) argumentative writing topics

bull Essays selected for annotation

ndash 830 argumentative essays from 13 prompts

ndash annotate each essay with its prompt adherence score

14

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 15: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

15

Prompt Adherence Scoring Rubric

  4 ndash essay fully addresses the prompt and

consistently stays on topic

  3 ndash essay mostly addresses the prompt or

occasionally wanders off topic

  2 ndash essay does not fully address the prompt or

consistently wanders off topic

  1 ndash essay does not address the prompt at all or is

completely off topic

bull Half-point increments (ie 15 25 35) allowed15

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 16: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

16

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

16

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 17: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

17

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

17

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 18: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

18

Inter-Annotator Agreement

bull 707 of 830 essays scored by both annotators

bull Perfect agreement on 38 of essays

bull Scores within 05 points on 66 of essays

bull Scores within 10 point on 89 of essays

bull Whenever annotators disagree use the average

score rounded to the nearest half point

18

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 19: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

19

Plan for the Talk

Corpus and Annotations

Approach for scoring prompt adherence

bull Evaluation

19

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 20: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

20

bull recast problem as a linear regression task

bull train one regressor per prompt

Approach for Scoring Prompt Adherence

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 21: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

21

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

Approach for Scoring Prompt Adherence

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 22: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

22

bull recast problem as a linear regression task

bull train one regressor per prompt

ndash common problems students have writing essays for one

prompt may not apply to essays written for another

bull one training instance per training essay

ndash ldquoclassrdquo value prompt adherence score

ndash learner LIBSVM

ndash features 7 feature types

Approach for Scoring Prompt Adherence

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 23: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

23

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 24: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

24

Features

bull 7 types of features

ndash baseline features

ndash 6 types of new features

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 25: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

25

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 26: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

26

Baseline Features

bull Features based on Random Indexing (RI)

ndash adapted from Higgins et al (2004)

bull Random indexing

ndash ldquoan efficient and scalable alternative to LSIrdquo (Sahlgren

2005)

ndash generates a semantic similarity measure between any

two words

ndash generalized to computing similarity between two

groups of words (Higgins amp Burstein 2007)

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 27: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

27

Why Random Indexing (RI)

bull May help find text in essay related to the prompt

even if some of its words have been rephrased

ndash Eg essay talks about ldquojailrdquo while prompt has ldquoprisonrdquo

bull Train a RI model on the English Gigaword

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 28: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

28

5 Random Indexing Features

bull The entire essayrsquos similarity to the prompt

bull The essayrsquos highest individual sentencersquos similarity to

the prompt

bull The highest entire essay similarity to one of the

prompt sentences

bull The highest individual sentence similarity to an

individual prompt sentence

bull The essayrsquos similarity to a manually rewritten version

of the prompt that excludes extraneous material

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 29: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

29

Features

bull 7 types of features

ndash baseline features

ndash 6 types of features

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 30: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

30

1 Thesis Clarity Keyword Features

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 31: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

31

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 32: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

32

bull introduced in Persing amp Ng (2013) for scoring the

thesis clarity of an essay

bull generated based on thesis clarity keywords

1 Thesis Clarity Keyword Features

refers to how clearly an author explains the thesis of her essay

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 33: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

33

What are Thesis Clarity Keywords

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 34: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

34

What are Thesis Clarity Keywords

bull ldquoimportantrdquo words in a prompt

ndash important word good word for a student to use when

stating her thesis about the prompt

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 35: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

35

How to Identify Keywords

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 36: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

36

How to Identify Keywords

bull By hand

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 37: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

37

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 38: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

38

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 39: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

39

Hand-Selecting Keywords

bull Hand-segment each multi-part prompt into parts

bull For each part hand-pick the most important (primary)

and second most important (secondary) words that it

would be good for a writer to use to address the part

The prison system is outdated No civilized society

should punish its criminals it should rehabilitate them

Primary rehabilitate

Secondary society

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 40: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

40

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 41: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

41

Designing Keyword Features

bull Example in one feature we

1 compute the random indexing similarity between

the essay and each group of primary keywords

taken from parts of the essayrsquos prompt

2 assign the feature the lowest of these values

bull A low feature value suggests that the student ignored

the prompt component from which the value came

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 42: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

42

Thesis Clarity Keyword Features

bull Though these features were designed for scoring thesis

clarity some of them are useful for prompt adherence

scoring

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 43: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

43

2 Prompt Adherence Keyword Features

bull Motivation rather than relying on keywords for thesis

clarity why not hand-pick keywords for prompt

adherence and create features from them

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 44: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

44

2 Prompt Adherence Keyword Features

bull Rather than relying on keywords for thesis clarity why

not hand-pick keywords for prompt adherence and

create features from them

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 45: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

45

An Illustrative Example

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 46: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

46

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 47: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

47

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 48: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

48

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 49: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

49

An Illustrative Example

bull This question suggests that students discuss whether

television is analogous to religion in this way

ndash prompt adherence keywords contain ldquoreligionrdquo

ndash thesis clarity keywords do not contain ldquoreligionrdquo

ndash A thesis like ldquoTelevision is badrdquo can be stated clearly without

reference to ldquoreligionrdquo

bull essay with this thesis could have high thesis clarity score

ndash But low adherence score Religion should be discussed

Marx once said that religion was the opium of the

masses If he was alive at the end of the 20th century he would replace religion with television

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 50: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

50

Creating Features from Keywords

bull Two types of features

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 51: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

51

Creating Features from Keywords

bull Two types of features

bull For each prompt component

1 take the RI similarity between the whole essay and the

componentrsquos keywords

2 compute the fraction of the componentrsquos keywords

that appear in the essay

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 52: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

52

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 53: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

53

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

All armies should consist entirely of professional soldiers

there is no value in a system of military service

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 54: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

54

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

All armies should consist entirely of professional soldiers

there is no value in a system of military service

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 55: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

55

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

All armies should consist entirely of professional soldiers

there is no value in a system of military service

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 56: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

56

3 LDA Topics

bull Motivation the features introduced so far have

trouble identifying topics that are related to but not

explicitly mentioned in the prompt

bull An essay containing words like ldquopeacerdquo ldquopatriotismrdquo or

ldquotrainingrdquo are probably not digressions and should not be

penalized for discussing these topics

ndash But the various measures of keyword similarities might not

notice that anything related to the prompt is discussed

ndash this might have effects like lowering the RI similarity scores

All armies should consist entirely of professional soldiers

there is no value in a system of military service

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 57: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

57

How to create LDA features

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 58: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

58

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 59: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

59

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 60: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

60

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Soft clustering of the words into 1000 sets

bull Eg for the most frequent topic for the military prompt

the five most important words are

ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo and ldquowarrdquo

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 61: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

61

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

ndash Model can tell us how much an essay spends on each topic

bull Eg

5Topic1000

helliphellip

15Topic3

45Topic2

25Topic1

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 62: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

62

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 63: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

63

How to create LDA features

For each prompt

1 collect all the essays in the ICLE corpus written in response

to it not just those we labeled

2 build an LDA of 1000 topics

3 construct 1000 features one for each topic

bull Feature value encodes how much of the essay was spent

discussing the topic

Eg if an essay written for the military prompt spends

55ldquofullyrdquo ldquocountrdquo ldquoordinaryrdquo ldquoczechrdquo ldquodayrdquo

45ldquomanrdquo ldquomilitaryrdquo ldquoservicerdquo ldquopayrdquo ldquowarrdquo

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 64: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

64

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 65: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

65

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 66: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

66

4 Manually Annotated LDA Topics

bull Motivation the regressor using LDA features may not be

able to distinguish an infrequent topic that is adherent to

the prompt and one that is an irrelevant digression

ndash An infrequent topic may not appear enough in the training

set for the regressor to make this judgment

bull Create manually annotated LDA features

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 67: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

67

How to create Manually Annotated

LDA Features

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 68: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

68

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 69: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

69

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 70: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

70

How to create Manually Annotated

LDA Features

1 For each set of essays written for a given prompt build an

LDA of 100 topics

2 For each topic inspect its top 10 words and hand-annotate

it with a number from 0 to 5 representing how likely it is

that the topic is adherent to the prompt

bull higher score more adherent

3 For each essay create 10 features from the labeled topics

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 71: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

71

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 72: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

72

10 Features from the labeled topics

bull Five features encode the sum of contributions to an essay

of topics annotated with the number 0 1 hellip 4 resp

bull Five features encode the sum of contributions to an essay

of topics annotated with a number ge 1 ge 2 hellip ge 5 resp

bull These features should give the regressor a better idea of

how much of an essay is composed of prompt-related vs

prompt-unrelated discussions

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 73: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

73

5 Predicted Thesis Clarity Errors

bull In previous work on thesis clarity essay scoring

(Persing amp Ng 2013) we

ndash score an essay wrt the clarity of its thesis

ndash determine which type(s) of errors an essay contains that

detract from the clarity of its thesis

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 74: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

74

5 Types of Thesis Clarity Errors

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 75: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

75

bull Confusing Phrasing

bull Missing Details

bull Writer Position

bull Incomplete Prompt Response

bull Relevance to Prompt

5 Types of Thesis Clarity Errors

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 76: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

76

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 77: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

77

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 78: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

78

Features based on Error Types

bull Introduced features for prompt adherence scoring

that encode the error types an essay contains

ndash Though each essay was manually annotated with the

errors it contains in a realistic setting we wonrsquot have

access to these manual annotations

bull Predict which of the 5 error types an essay contains

ndash Recast as a multi-label classification task

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 79: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

79

Creating Predicted ldquoError Typerdquo Features

bull Add a binary feature indicating the presence or

absence of each error type

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 80: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

80

6 N-gram features

bull Can capture useful words and phrases related to a

prompt

bull 10K lemmatized unigrams bigrams and trigrams

ndash selected according to information gain

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 81: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

81

Summary of Features

bull Baseline features

ndash Random indexing features

bull Six types of features

ndash Lemmatized n-grams

ndash Thesis clarity keyword features

ndash Prompt adherence keyword features

ndash LDA topics

ndash Manually annotated LDA topics

ndash Predicted thesis clarity errors

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 82: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

82

Corpus and Annotations

Model for scoring prompt adherence

Evaluation

Plan for the Talk

82

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 83: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

83

Evaluation

bull Goal evaluate our system for prompt

adherence scoring

bull 5-fold cross validation

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 84: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

84

Scoring Metrics

bull Define 4 evaluation metrics

84

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 85: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

85

bull Define 4 evaluation metrics

85

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)

Scoring Metrics

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 86: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

86

bull Define 4 evaluation metrics

86

probability that a system

predicts the wrong score out

of 7 possible scores (1 15 2

25 3 35 4)annotated scores

estimated scores

Scoring Metrics

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 87: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

87

Scoring Metrics

bull Define 4 evaluation metrics

87

(probability of error)

average absolute error

annotated scores

estimated scores

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 88: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

88

Scoring Metrics

bull Define 4 evaluation metrics

88

(probability of error)

distinguishes near misses from

far misses

annotated scores

estimated scores

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 89: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

89

Scoring Metrics

bull Define 4 evaluation metrics

89

(probability of error)

(average absolute error)

average squared error

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 90: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

90

Scoring Metrics

bull Define 4 evaluation metrics

90

(probability of error)

(average absolute error)

prefer systems whose

estimations are not too often

far away from correct scores

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 91: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

91

Scoring Metrics

bull Define 4 evaluation metrics

91

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 92: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

92

Scoring Metrics

bull Define 4 evaluation metrics

92

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 93: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

93

Scoring Metrics

bull Define 4 evaluation metrics

93

(probability of error)

(average absolute error)

(average squared error)

PC Pearsonrsquos correlation coefficient between Ai and Ei

S1 S2 S3

bull error metrics

bull smaller value is better

PC

bull correlation coefficient

bull larger value is better

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 94: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

94

Regressor Training

bull SVM regressors are trained to maximize performance

wrt each scoring metric by tuning the regularization

parameter on held-out development data

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 95: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

95

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

95

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 96: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

96

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

96

RI features

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 97: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

97

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

97

probability of error

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 98: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

98

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

98

probability of error

average absolute

error

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 99: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

99

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

99

probability of error

average absolute

error

average squared

error

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 100: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

100

Results

233

PCSystem S1 S2 S3

Baseline 517 368 234

+ Misspelling

feature

654 515 402

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

100

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 101: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

101

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

101

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 102: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

102

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

102

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

bull Improvements wrt all four scoring metrics

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 103: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

103

bull Improvements wrt all four scoring metricsbull Improvements wrt all four scoring metrics

Results

360

233

PCSystem S1 S2 S3

Baseline 517 368 234

Our system 488 348 197

+ Keyword

features

663 490 369

+ Aggregated

word n-grams

651 484 374

+ Aggregated

POS n-grams

671 483 377

+ Aggregated

frames

672 486 382

103

probability of error

average absolute

error

average squared

error

Pearsonrsquos

τ

Significant differences

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 104: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

104

Feature Ablation

bull Goal examine how much impact each of the feature

types has on our systemrsquos performance wrt each

scoring metric

ndash Train a regressor on all but one type of features

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 105: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

105

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 106: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

106

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-gramsS2

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 107: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

107

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

S2

S3

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 108: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

108

Feature Ablation Results

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 109: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

109

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 110: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

110

bull Relative importance of features does not always remain

consistent if we measure performance in different ways

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 111: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

111

bull Buthellip there are feature that tend to be more important than

the others in the presence of other features

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 112: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

112

bull most important TC keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 113: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

113

bull most important TC keywords n-grams

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 114: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

114

bull most important TC keywords n-grams annotated LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 115: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

115

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 116: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

116

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 117: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

117

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 118: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

118

bull most important TC keywords n-grams annotated LDA topics

bull middling important RI unlabeled LDA topics

bull least important predicted TC errors PA keywords

PA

keywords

Annotated

LDA topicsN-grams

Predicted

TC errorsRI

Unlabeled

LDA topics

TC

keywordsS1

Predicted

TC errors

Unlabeled

LDA topics

PA

keywordsRI

TC

keywords

Annotated

LDA topicsN-grams

TC

keywords

Predicted

TC errors

PA

keywords

Unlabeled

LDA topicsRI

Annotated

LDA topicsN-grams

PA

keywords

Unlabeled

LDA topics

Predicted

TC errorsRIN-grams

TC

keywords

Annotated

LDA topics

S2

S3

PC

Most important

Least important

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics

Page 119: Modeling Prompt Adherence in Student Essays - HLTRIvince/talks/acl14-essay.pdf · Modeling Prompt Adherence in Student Essays Isaac Persing and Vincent Ng Human Language Technology

119

Summary

bull Examined the problem of prompt adherence

scoring in student essays

ndash feature-rich approach

bull Released the annotations

ndash prompt adherence scores

ndash prompt adherence keywords

ndash manually annotated LDA topics