re-evaluating bleu

32
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006

Upload: milly

Post on 22-Jan-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Re-evaluating Bleu. Alison Alvarez Machine Translation Seminar February 16, 2006. Overview. The Weaknesses of Bleu Introduction Precision and Recall Fluency and Adequacy Variations Allowed by Bleu Bleu and Tides 2005 An Improved Model Overview of the Model Experiment Results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Re-evaluating Bleu

Re-evaluating Bleu

Alison AlvarezMachine Translation Seminar

February 16, 2006

Page 2: Re-evaluating Bleu

Spring 2006 MT Seminar

Overview

• The Weaknesses of Bleu Introduction Precision and Recall Fluency and Adequacy Variations Allowed by Bleu Bleu and Tides 2005

• An Improved Model Overview of the Model Experiment Results

• Conclusions

Page 3: Re-evaluating Bleu

Spring 2006 MT Seminar

Introduction

• Bleu has been shown to have high correlations with human judgments

• Bleu has been used by MT researchers for five years, sometimes in place of manual human evaluations

• But does the minimization of the error rate accurately show improvements in translation quality?

Page 4: Re-evaluating Bleu

Spring 2006 MT Seminar

Precision and Bleu

• Of my answers, how many are right/wrong?

• Precision = B C / C or A/C

A

Reference Translation Hypothesis Translation

B C

Page 5: Re-evaluating Bleu

Spring 2006 MT Seminar

Precision and Bleu

Bleu is a precision based metric

• The modified precision score, pn:

Pn = ∑sc ∑ngramsCountmatched(ngram)

∑sc ∑ngramsCount(ngram)

Page 6: Re-evaluating Bleu

Spring 2006 MT Seminar

Recall and Bleu

• Of the potential answers how many did I retrieve/miss?

• Recall = B C / B or A/B

A

Reference Translation Hypothesis Translation

B C

Page 7: Re-evaluating Bleu

Spring 2006 MT Seminar

Recall and Bleu

• Because Bleu uses multiple reference translations at once, recall cannot be calculated

Page 8: Re-evaluating Bleu

Spring 2006 MT Seminar

Fluency and Adequacy to Evaluators

• Fluency “How do you judge the fluency of this

translation” Judged with no reference translation and

to the standard of written English

• Adequacy “How much of the meaning expressed

in the reference is also expressed in the hypothesis translation?”

Page 9: Re-evaluating Bleu

Spring 2006 MT Seminar

Variations

• Bleu allows for variations in word and phrase order that lead to less fluency

• No constraints occur on the order of matching n-grams

Page 10: Re-evaluating Bleu

Spring 2006 MT Seminar

Variations

Page 11: Re-evaluating Bleu

Spring 2006 MT Seminar

Variations

The above two translations have the same bigram score.

Page 12: Re-evaluating Bleu

Spring 2006 MT Seminar

Bleu and Tides 2005

• Bleu scores showed significant divergence from human judgments in the 2005 Tides Evaluation

• It ranked the system considered the best by humans as sixth in performance

Page 13: Re-evaluating Bleu

Spring 2006 MT Seminar

Bleu and Tides 2005

• Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs

• System A: Iran has already stated that Kharazi’s statements to the conference because of the Jordanian King Abdullah II in which he stood accused Iran of interfering in Iraqi affairs.

• N-gram matches: 1-gram: 27; 2-gram: 20; 3-gram: 15; 4 gram: 10

• Human scores: Adequacy: 3,2; Fluency 3,2From Callison-Burch 2005

Page 14: Re-evaluating Bleu

Spring 2006 MT Seminar

Bleu and Tides 2005

• Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs

• System B: Iran already announced that Kharazi will not attend the conference because of statements made by Jordanian Monarch Abdullah II who has accused Iran of interfering in Iraqi affairs.

• N-gram matches: 1-gram: 24; 2-gram: 19; 3-gram: 15; 4 gram: 12

• Human scores: Adequacy: 5,4; Fluency 5,4From Callison-Burch 2005

Page 15: Re-evaluating Bleu

Spring 2006 MT Seminar

An Experiment with Bleu

Page 16: Re-evaluating Bleu

Spring 2006 MT Seminar

Bleu and Tides 2005

• “This opens the possibility that in order to for Bleu to be valid only sufficiently similar systems should be compared with one another”

Page 17: Re-evaluating Bleu

Spring 2006 MT Seminar

Additional Flaws

• Multiple Human reference translations are expensive

• N-grams showing up in multiple reference translations are weighted the same

• Content words are weighed the same as common words ‘The’ counts the same as ‘Parliament’

• Bleu accounts for the diversity of human translations, but not synonyms

Page 18: Re-evaluating Bleu

Spring 2006 MT Seminar

An Extension of Bleu

• Described in Babych & Hartley, 2004• Adds weights to matched items using

tf/idf S-score

Page 19: Re-evaluating Bleu

Spring 2006 MT Seminar

Addressing Flaws

• Can work with only one human translation Can actually calculate recall The paper is not very clear about this sentence

is selected

• Content words are weighed the differently than common words ‘The’ does not count the same as ‘Parliament’

Page 20: Re-evaluating Bleu

Spring 2006 MT Seminar

Calculating the tf/idf Score

• tf.idf(i,j) = (1 + log (tfi,j)) log (N / dfi),

• if tfi,j ≥ 1; where: tfi,j is the number of occurrences of the word wi in the

document dj;

dfi is the number of documents in the corpus where the word wi occurs;

• N is the total number of documents in the corpus.From Babych 2004

Page 21: Re-evaluating Bleu

Spring 2006 MT Seminar

Calculating the S-Score

• The S-score was calculated as:

Pdoc(i,j) is the relative frequency of the word in the text Pcorp-doc(i) is the relative frequency of the same word in the rest

of the corpus, without this text; (N – df(i)) / N is the proportion of texts in the corpus, where

this word does not occur Pcorp(i) is the relative frequency of the word in the whole

corpus, including this particular text.

( ))( )()](),( /)(log),( icorpiidoccorpjidocP NdfNPPjiS −×−= −

Page 22: Re-evaluating Bleu

Spring 2006 MT Seminar

Integrating the S-Score

• If for a lexical item in a text the S‑score > 1, all counts for the N-grams containing this item are increased by the S-score (not just by 1, as in the baseline BLEU approach).

• If the S-score ≤1; the usual N-gram count is applied: the number is increased by 1.

From Babych 2004

Page 23: Re-evaluating Bleu

Spring 2006 MT Seminar

The Experiment

• Used 100 French-English texts from the DARPA-94 evaluation corpus

• Included two reference translations• Results from 4 Different MT systems

Page 24: Re-evaluating Bleu

Spring 2006 MT Seminar

The Experiment

• Stage 1: tf/idf & S-scores are calculated on the two

reference translations

• Stage 2: N-gram based evaluation using Precision,

Recall of n-grams in MT output N-gram matches were adjusted to N-gram

weights or S-Score

• Stage 3: Comparison with human scores

Page 25: Re-evaluating Bleu

Spring 2006 MT Seminar

Results for tf/idf

System[ade] / [flu]

BLEU[1&2]

Prec.(w) 1/2

Recall(w) 1/2

Fscore(w) 1/2

CANDIDE0.677 / 0.455

0.3561 0.47670.4709

0.33630.3324

0.39440.3897

GLOBALINK0.710 / 0.381

0.3199 0.42890.4277

0.31460.3144

0.36300.3624

MS0.718 / 0.382

0.3003 0.42170.4218

0.33320.3354

0.37230.3737

REVERSONA / NA

0.3823 0.47600.4756

0.36430.3653

0.41270.4132

SYSTRAN0.789 / 0.508

0.4002 0.48640.4813

0.37590.3734

0.42410.4206

Corr r(2) with [ade] – MT

0.5918 0.33990.3602

0.79660.8306

0.64790.6935

Corr r(2) with [flu] – MT

0.9807 0.96650.9721

0.89800.8505

0.98530.9699

Page 26: Re-evaluating Bleu

Spring 2006 MT Seminar

Results for S-Score

System[ade] / [flu]

BLEU[1&2]

Prec.(w) 1/2

Recall(w) 1/2

Fscore(w) 1/2

CANDIDE0.677 / 0.455

0.3561 0.45700.4524

0.32810.3254

0.38200.3785

GLOBALINK0.710 / 0.381

0.3199 0.40540.4036

0.30860.3086

0.35040.3497

MS0.718 / 0.382

0.3003 0.39630.3969

0.32370.3259

0.35630.3579

REVERSONA / NA

0.3823 0.45470.4540

0.35630.3574

0.39960.4000

SYSTRAN0.789 / 0.508

0.4002 0.46330.4585

0.36660.3644

0.40940.4061

Corr r(2) with [ade] – MT

0.5918 0.29450.2996

0.80460.8317

0.61840.6492

Corr r(2) with [flu] – MT

0.9807 0.95250.9555

0.90930.8722

0.99420.9860

Page 27: Re-evaluating Bleu

Spring 2006 MT Seminar

Results

• The n-gram model beats BLEU in adequacy

• The f-score metric is more strongly correlated with fluency

• Single Reference translations are stable (add stability chart?)

Page 28: Re-evaluating Bleu

Spring 2006 MT Seminar

Conclusions

• The Bleu model can be too coarse to show differentiate between very different MT systems

• Adequacy is harder to predict than fluency

• Adding weights and using recall and f-scores can bring higher correlations with adequacy and fluency scores

Page 29: Re-evaluating Bleu

Spring 2006 MT Seminar

References

• Chris Callison-Burch, Miles Osborne and Philipp Koehn. 2006. Re-evaluating the Role of Bleu in Machine Translation Research, to appear in EACL-06.

• Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. July 2002. pp. 311-318.

• Babych B, Hartley A. 2004. Extending BLEU MT Evaluation Method with Frequency Weighting, In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain. July 2004.

• Dan Melamed, Ryan Green, and joseph P. Turian. Precision and recall of machine translation. In Proceedings of the Human Language Technology Conference (HLT), pages 61--63, Edmonton, Alberta, May 2003. HLT-NAACL. http://citeseer.csail.mit.edu/melamed03precision.html

• Deborah Coughlin. 2003. Correlating automated andhuman assessments of machine translation quality.In Proceedings of MT Summit IX.

• LDC. 2005. Linguistic data annotation specification:Assessment of fluency and adequacy in translations.Revision 1.5

Page 30: Re-evaluating Bleu

Spring 2006 MT Seminar

• The Brevity Penalty is designed to compensate for overly terse translations

BP = {c = length of corpus of hypothesis translationsr = effective corpus length*

Precision and Bleu

1 if c > re1-r/c if c ≤ r

Page 31: Re-evaluating Bleu

Spring 2006 MT Seminar

• Thus, the total Bleu score is this:

BLEU = BP * exp(∑ wn log pn)

Precision and Bleu

n

n=1

Page 32: Re-evaluating Bleu

Spring 2006 MT Seminar

Flaws in the Use of Bleu

• Experiments with Bleu, but no manual evaluation (Callison-Burch 2005)