using artificial intelligence to support peer review of writing

Using Artificial Intelligence to Support Peer Review of Writing

Diane Litman

Department of Computer Science,Intelligent Systems Program, &

Learning Research and Development Center

Context

Speech and Language Processing for Education

Learning Language(reading, writing,

speaking)

Using Language (to teach everything else)

Tutors

Scoring

Readability

Processing Language

Tutorial Dialogue

Systems / Peers

CSCLDiscourse

CodingLecture

Retrieval

Questioning& Answering

Peer Review

Outline

SWoRD Improving Review Quality Identifying Helpful Reviews Summary and Current Directions

SWoRD [Cho & Schunn, 2007]

Authors submit papers Reviewers submit (anonymous) feedback Authors revise and resubmit papers Authors provide back-ratings to reviewers

regarding feedback helpfulness

Some Weaknesses

1. Feedback is often not stated in effective ways

2. Feedback and papers often do not focus on core aspects

Our Approach: Detect and Scaffold

1. Detect and direct reviewer attention to key feedback features such as solutions

2. Detect and direct reviewer and author attention to thesis statements in papers and feedback

Improving Learning from Peer Review with NLP and ITS Techniques (with Ashley, Schunn), LRDC internal grant

Feedback Features and Positive Writing Performance [Nelson & Schunn, 2008]

Solutions

Summarization

Localization

Understanding of the Problem

Implementation

I. Detecting Key Feedback Features

Natural Language Processing (NLP) to extract attributes from text, e.g.– Regular expressions (e.g. “the section about”)– Domain lexicons (e.g. “federal”, “American”)– Syntax (e.g. demonstrative determiners)– Overlapping lexical windows (quotation identification)

Machine Learning (ML) to predict whether feedback contains localization and solutions, and whether papers contain a thesis statement

Learned Localization Model [Xiong, Litman & Schunn, 2010]

Quantitative Model Evaluation

Feedback Feature

ClassroomCorpus

N BaselineAccuracy

ModelAccuracy

ModelKappa

HumanKappa

Localization

History 875 53% 78% .55 .69

Psychology 3111 75% 85% .58 .63

Solution

History 1405 61% 79% .55 .79

CogSci 5831 67% 85% .65 .86

II. Predicting Feedback Helpfulness

Can expert helpfulness ratings be predicted from text? [Xiong & Litman, 2011a]

Impact of predicting student versus expert helpfulness ratings

[Xiong & Litman, 2011b]

Results: Predicting Expert Ratings (average of writing and domain experts)

Techniques used in ranking product review helpfulness can be effectively adapted to peer-reviews (R = .6) Structural attributes (e.g. review length, number of questions) Lexical statistics Meta-data (e.g. paper ratings) However, the relative utility of such features varies

Peer-review features improve performance (R = .7) Theory-motivated (e.g. localization) Abstraction (e.g. lexical categories) better for small corpora

Changing the meaning of “helpfulness”

Helpfulness may be perceived differently by different types of people

Average of two experts (prior experiment)

Writing expert

Content expert

Student peers

14

Content versus Writing Experts– Writing-expert rating = 2

– Content-expert rating = 5

15

Your over all arguements were organized in some order but was

unclear due to the lack of thesis in the paper. Inside each arguement, there was no order to the ideas presented,

they went back and forth between ideas. There was good support to the

arguements but yet some of it didnt not fit your arguement.

First off, it seems that you have difficulty writing transitions between paragraphs. It

seems that you end your paragraphs with the main idea of each paragraph. That being

said, … (omit 173 words) As a final comment, try to continually move your

paper, that is, have in your mind a logical flow with every paragraph having a

purpose.

• Writing-expert rating = 5

• Content-expert rating = 2

Argumentation issue

Argumentation issue

Transition issue Transition issue

Results: Other Helpfulness Ratings Generic features are more predictive for student ratings

Lexical features: transition cues, negation, suggestion words Meta features: paper rating

Theory-supported features are more useful for experts Both experts: solution Writing expert: praise Content expert: critiques, localization

16

Summary Artificial Intelligence (NLP and ML) can be used to

automatically detect desirable feedback features

– localization, solution

– feedback and reviewer levels

Techniques used to predict product review helpfulness can be effectively adapted to peer-review– Knowledge of peer-reviews increases performance

– Helpfulness type influences feature utility

17

Current and Future Work

Extrinisic evaluation in SWoRD –Intelligent Scaffolding for Peer Reviews of Writing (with Ashley, Godley, Schunn), IES (recommended for funding)

Extend to reviews of argument diagrams –Teaching Writing and Argumentation with AI-Supported Diagramming and Peer Review (with Ashley, Schunn), NSF

Teacher dashboard –Keeping Instructors Well-informed in Computer-Supported Peer Review (with Ashley, Schunn, Wang), LRDC internal grant

18

Thank you!

Questions?

19

Peer versus Product Reviews Helpfulness is directly rated on a scale (rather than

a function of binary votes) Peer reviews frequently refer to the related papers Helpfulness has a writing-specific semantics Classroom corpora are typically small

20

Generic Linguistic Features

type Label Features (#)

Structural STRrevLength, sentNum, question%,

exclamationNum

Lexical UGR, BGRtf-idf statistics of

review unigrams (#= 2992) and bigrams (#= 23209)

Syntactic SYNNoun%, Verb%, Adj/Adv%, 1stPVerb%,

openClass%

Semantic(adapted)

TOP counts of topic words (# = 288) ;

posW, negWcounts of positive (#= 1319)

and negative sentiment words (#= 1752)

Meta-data(adapted)

META paperRating, paperRatingDiff

21

Type Label Features (#)

Cognitive Science

cogSpraise%, summary%, criticism%,

plocalization%, solution%Lexical

CategoriesLEX2 Counts of 10 categories of words

Localization LOCFeatures developed for

identifying problem localization

Specialized Features

22

Lexical Categories

Extracted from:1. Coding Manuals2. Decision trees trained with Bag-of-Words

23

Tag Meaning Word list

SUG suggestion should, must, might, could, need, needs, maybe, try, revision, want

LOC location page, paragraph, sentence

ERR problem error, mistakes, typo, problem, difficulties, conclusion

IDE idea verb consider, mention

LNK transition however, but

NEG negative fail, hard, difficult, bad, short, little, bit, poor, few, unclear, only, more

POS positive great, good, well, clearly, easily, effective, effectively, helpful, very

SUM summarization main, overall, also, how, job

NOT negation not, doesn't, don't

SOL solution revision, specify, correction

Discussion

24

• Effectiveness of generic features across domains• Same best generic feature combination (STR+UGR+MET)• But…

Results: Specialized Features

25

• Introducing high level features does enhance the model’s performance. Best model: Spearman correlation of 0.671 and Pearson

correlation of 0.665.

Feature Type r rs

cogS 0.425+/-0.094 0.461+/-0.072

LEX2 0.512+/-0.013 0.495+/-0.102

LOC 0.446+/-0.133 0.472+/-0.113

STR+MET+UGR (Baseline) 0.615+/-0.101 0.609+/-0.098

STR+MET+LEX2 0.621+/-0.096 0.611+/-0.088

STR+MET+LEX2+TOP 0.648+/-0.097 0.655+/-0.081

STR+MET+LEX2+TOP+cogS 0.660+/-0.093 0.655+/-0.081

STR+MET+LEX2+TOP+cogS+LOC 0.665+/-0.089 0.671+/-0.076

• Student rating = 3• Expert-average rating

= 5

Students versus Experts

26

The author also has great logic in this paper. How can we consider the United

States a great democracy when everyone is not treated equal. All of the main points were indeed supported in

this piece.

I thought there were some good opportunities to provide further data to strengthen your argument. For example

the statement “These methods of intimidation, and the lack of military force

offered by the government to stop the KKK, led to the rescinding of African American democracy.” Maybe here

include data about how … (omit 126 words)

praisepraise

CritiqueCritique

– Student rating = 7

– Expert-average rating = 2

Sample Result: All Features

27

• Feature selection of all features• Students are more influenced by meta features, demonstrative

determiners, number of sentences, and negation words• Experts are more influenced by review length and critiques

• Content expert values solutions, domain words, problem localization• Writing expert values praise and summary

using artificial intelligence to support peer review of writing

Documents

litman schunn

papers authors

review length

papers reviewers

anonymous feedback authors

contentexpert rating

paper ratings

learning research