evaluation of text generation: automatic evaluation vs ...cs136a/cs136a_slides/...+experiment 1...
TRANSCRIPT
![Page 1: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/1.jpg)
+Evaluation of Text Generation: Automatic Evaluation vs. Variation
Amanda Stent, Mohit Singhai, Matthew MargeColumbia University
![Page 2: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/2.jpg)
+ Natural Language Generation
Content Planning
Text Planning
Sentence Planning
Surface Realization
![Page 3: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/3.jpg)
+ Approaches to Surface Realization
n Template basedn Domain-specificn All output tends to be high quality because highly constrained
n Grammar basedn Typically one high quality output per input
n Forest basedn Many outputs per input
n Text-to-textn No need for other generation components
![Page 4: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/4.jpg)
+ Surface Realization Tasks
n To communicate the input meaning as completely, clearly and elegantly as possible by careful:n Word selectionn Word and phrase arrangementn Consideration of context
![Page 5: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/5.jpg)
Importance of Lexical Choice
I drove to Rochester.
I raced to Rochester.
I went to Rochester.
![Page 6: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/6.jpg)
+ Importance of Syntactic Structuren I picked up my coat three weeks later from the dry
cleaners in Smithtown
n In Smithtown I picked up my coat from the dry cleaners three weeks later
![Page 7: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/7.jpg)
+ Evaluating Text Generatorsn Per-generator: Coverage
n Per-sentence:n Adequacyn Fluency / syntactic accuracyn Informativeness
n Additional metrics of interest:n Range: Ability to produce valid variantsn Readabilityn Task-specific metrics
n E.g. for dialog
![Page 8: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/8.jpg)
+ Evaluating Text Generatorsn Human judgments
n Parsing+interpretation
n Automatic evaluation metrics – for generation or machine translationn Simple string accuracy+n NIST*n BLEU*+n F-measure*n LSA
![Page 9: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/9.jpg)
+Question
What is a “good” sentence?
readable
fluent
adequate
![Page 10: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/10.jpg)
Approach
n Question: Which evaluation metric or set of evaluation metrics least punishes variation?n Word choice variationn Syntactic structure variation
n Procedure: Correlation between human and automatic judgments of variationsn Context not included
![Page 11: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/11.jpg)
+ Lexical and Syntactic variation
(a) I bought tickets for the show on Tuesday.
(b) It was the show on Tuesday for which I bought tickets.
(c) I got tickets for the show on Tuesday.
(d) I bought tickets for the Tuesday show.(e) On Tuesday I bought tickets for the show.
(f) For the show on Tuesday tickets I bought.
![Page 12: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/12.jpg)
+ String Accuracyn Simple String Accuracy
n (I+D+S) / #Words (Callaway 2003, Langkilde 2002, Rambow et. al. 2002, Leusch et. al. 2003)
n Generation String Accuracyn (M + I’ + D’ + S) / #Words (Rambow et. al. 2002)
![Page 13: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/13.jpg)
+ BLEU
n Developed by Papenini et. al. at IBM
n Key idea: count matching subsequences between the reference and candidate sentences
n Avoid counting matches multiple times by clipping
n Punish differences in sentence length
![Page 14: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/14.jpg)
+ NIST ngram
n Designed to fix two problems with BLEU:n Geometric mean penalizes large N
n Might like to prefer ngrams that are more informative = less likely
n Arithmetic average over all ngram co-occurrences
n Weight “less likely” ngrams more
n Use a brevity factor to punish varying sentence lengths
![Page 15: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/15.jpg)
+ F Measure
n Idea due to Melamed 1995, Turian et. al. 2003
n Same basic idea as ngram measures
n Designed to eliminate “double counting” done by ngram measures
n F = 2*precision*recall / (precision + recall)
n Precision(candidate|reference) = maximum matching size(candidate, reference) / |candidate|
n Precision(candidate|reference) = maximum matching size(candidate, reference) / |reference|
![Page 16: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/16.jpg)
+ F Measure
![Page 17: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/17.jpg)
+ LSA
n Doesn’t care about word order
n Evaluates how similar two bags of words are with respect to a corpusn Measures “similarity” with coccurrence vectors
n A good way of evaluating word choice?
![Page 18: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/18.jpg)
Eval. Metric
Means of measuring fluency
Means of measuring adequacy
Means of measuring readability
Punishes length differences?
SSA Comparison against reference sentence
Comparison against reference sentence
Comparison against reference sentence from same context*
Yes (punishes deletions, insertions)
NIST n-gram, BLEU
Comparison against reference sentences --matching n-grams
Comparison against reference sentences
Comparison against reference sentences from same context*
Yes (weights)
F measure
Comparison against reference sentences --longest matching substrings
Comparison against reference sentences
Comparison against reference sentences from same context*
Yes (length factor)
LSA None Comparison against word co-occurrence frequencies learned from corpus
None Not explicitly
![Page 19: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/19.jpg)
+ Experiment 1
n Sentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and Lee 2002)n Includes word choice variation, e.g.
n Another person was also seriously wounded in the attack vsn Another individual was also seriously wounded in the attack
n Includes word order variation, e.g. n A suicide bomber blew himself up at a bus stop east of Tel Aviv
on Thursday, killing himself and wounding five bystanders, one of them seriously, police and paramedics said
n A suicide bomber killed himself and wounded five, when he blew himself up at a bus stop east of Tel Aviv on Thursday
![Page 20: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/20.jpg)
+ Paraphrase Generation1. Cluster like sentences
• By hand or using word n-gram co-occurrence statistics
• May first remove certain details
2. Compute multiple-sequence alignment• Choice points and regularities in input sentence
pairs/sets in a corpus
3. Match lattices• Match between corpora
4. Generate• Lattice alignment
![Page 21: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/21.jpg)
+ Paraphrase Generation Issues
n Sometimes words chosen for substitution carry unwanted connotations
n Sometimes extra words are chosen for inclusion (or words removed) that change the meaning
![Page 22: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/22.jpg)
+ Discussionn These metrics achieve some level of correlation with
human judgments of adequacyn But could we do better?
n Most metrics are negatively correlated with fluencyn Word order or constituent order variation requires a different
metric
n Automatic evaluation metrics other than LSA punish word choice variation
n Automatic evaluation metrics other than LSA punish word order variationn Are not as effected by word order variation as by word choice
variationn Punish legitimate and illegitimate word and constituent
reorderings equally
![Page 23: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/23.jpg)
+ Discussionn Fluency
n These metrics are not adequate for evaluating fluency in the presence of variation
n Adequacyn These metrics are barely adequate for evaluating adequacy
in the presence of variation
n Readabilityn These metrics do not claim to evaluate readability
![Page 24: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/24.jpg)
+ A Preliminary Proposal
n Modify automatic evaluation metrics as follows:n Not punish legitimate word choice variation
n E.g. using WordNetn But the ‘simple’ approach doesn’t work
n Not punish legitimate word order variationn But need a notion of constituency
![Page 25: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/25.jpg)
+ Another Preliminary Proposal
n When using metrics that depend on a reference sentence, usen A set of reference sentences
n Try to get as many of the word choice and word order variations as possible in the reference sentences
n Reference sentences from the same context as the candidate sentencen To approach an evaluation of readability
n And combine with some other metric for fluencyn For example, a grammar checker
![Page 26: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/26.jpg)
+ A Proposaln To evaluate a generator:
n Evaluate for coverage using recall or related metricn Evaluate for ‘precision’ using separate metrics for fluency,
adequacy and readabilityn At this point in time, only fluency may be evaluable automatically,
using a grammar checkern Adequacy can be approached using LSA or related metricn Readability can only be evaluated using human judgments at this
time
![Page 27: Evaluation of Text Generation: Automatic Evaluation vs ...cs136a/CS136a_Slides/...+Experiment 1 nSentence data from Barzilay and Lee’s paraphrase generation system (Barzilay and](https://reader034.vdocument.in/reader034/viewer/2022042400/5f0f067b7e708231d4421dcd/html5/thumbnails/27.jpg)
Current and Future Work
n Other potential evaluation metrics:n F measure plus WordNetn Parsing as measure of fluencyn F measure plus LSAn Use multiple-sequence alignment as an evaluation
metric
n Metrics that evaluate readability