re-evaluating bleu

Re-evaluating Bleu

Alison AlvarezMachine Translation Seminar

February 16, 2006

Spring 2006 MT Seminar

Overview

• The Weaknesses of Bleu Introduction Precision and Recall Fluency and Adequacy Variations Allowed by Bleu Bleu and Tides 2005

• An Improved Model Overview of the Model Experiment Results

• Conclusions


Introduction

• Bleu has been shown to have high correlations with human judgments

• Bleu has been used by MT researchers for five years, sometimes in place of manual human evaluations

• But does the minimization of the error rate accurately show improvements in translation quality?


Precision and Bleu

• Of my answers, how many are right/wrong?

• Precision = B C / C or A/C

A

Reference Translation Hypothesis Translation

B C


Precision and Bleu

Bleu is a precision based metric

• The modified precision score, pn:

Pn = ∑sc ∑ngramsCountmatched(ngram)

∑sc ∑ngramsCount(ngram)


Recall and Bleu

• Of the potential answers how many did I retrieve/miss?

• Recall = B C / B or A/B

A

Reference Translation Hypothesis Translation

B C


Recall and Bleu

• Because Bleu uses multiple reference translations at once, recall cannot be calculated


Fluency and Adequacy to Evaluators

• Fluency “How do you judge the fluency of this

translation” Judged with no reference translation and

to the standard of written English

• Adequacy “How much of the meaning expressed

in the reference is also expressed in the hypothesis translation?”


Variations

• Bleu allows for variations in word and phrase order that lead to less fluency

• No constraints occur on the order of matching n-grams


Variations


Variations

The above two translations have the same bigram score.


Bleu and Tides 2005

• Bleu scores showed significant divergence from human judgments in the 2005 Tides Evaluation

• It ranked the system considered the best by humans as sixth in performance


Bleu and Tides 2005

• Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs

• System A: Iran has already stated that Kharazi’s statements to the conference because of the Jordanian King Abdullah II in which he stood accused Iran of interfering in Iraqi affairs.

• N-gram matches: 1-gram: 27; 2-gram: 20; 3-gram: 15; 4 gram: 10

• Human scores: Adequacy: 3,2; Fluency 3,2From Callison-Burch 2005


Bleu and Tides 2005

• Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs

• System B: Iran already announced that Kharazi will not attend the conference because of statements made by Jordanian Monarch Abdullah II who has accused Iran of interfering in Iraqi affairs.

• N-gram matches: 1-gram: 24; 2-gram: 19; 3-gram: 15; 4 gram: 12

• Human scores: Adequacy: 5,4; Fluency 5,4From Callison-Burch 2005


An Experiment with Bleu


Bleu and Tides 2005

• “This opens the possibility that in order to for Bleu to be valid only sufficiently similar systems should be compared with one another”


Additional Flaws

• Multiple Human reference translations are expensive

• N-grams showing up in multiple reference translations are weighted the same

• Content words are weighed the same as common words ‘The’ counts the same as ‘Parliament’

• Bleu accounts for the diversity of human translations, but not synonyms


An Extension of Bleu

• Described in Babych & Hartley, 2004• Adds weights to matched items using

tf/idf S-score


Addressing Flaws

• Can work with only one human translation Can actually calculate recall The paper is not very clear about this sentence

is selected

• Content words are weighed the differently than common words ‘The’ does not count the same as ‘Parliament’


Calculating the tf/idf Score

• tf.idf(i,j) = (1 + log (tfi,j)) log (N / dfi),

• if tfi,j ≥ 1; where: tfi,j is the number of occurrences of the word wi in the

document dj;

dfi is the number of documents in the corpus where the word wi occurs;

• N is the total number of documents in the corpus.From Babych 2004


Calculating the S-Score

• The S-score was calculated as:

Pdoc(i,j) is the relative frequency of the word in the text Pcorp-doc(i) is the relative frequency of the same word in the rest

of the corpus, without this text; (N – df(i)) / N is the proportion of texts in the corpus, where

this word does not occur Pcorp(i) is the relative frequency of the word in the whole

corpus, including this particular text.

( ))( )()](),( /)(log),( icorpiidoccorpjidocP NdfNPPjiS −×−= −


Integrating the S-Score

• If for a lexical item in a text the S‑score > 1, all counts for the N-grams containing this item are increased by the S-score (not just by 1, as in the baseline BLEU approach).

• If the S-score ≤1; the usual N-gram count is applied: the number is increased by 1.

From Babych 2004


The Experiment

• Used 100 French-English texts from the DARPA-94 evaluation corpus

• Included two reference translations• Results from 4 Different MT systems


The Experiment

• Stage 1: tf/idf & S-scores are calculated on the two

reference translations

• Stage 2: N-gram based evaluation using Precision,

Recall of n-grams in MT output N-gram matches were adjusted to N-gram

weights or S-Score

• Stage 3: Comparison with human scores


Results for tf/idf

System[ade] / [flu]

BLEU[1&2]

Prec.(w) 1/2

Recall(w) 1/2

Fscore(w) 1/2

CANDIDE0.677 / 0.455

0.3561 0.47670.4709

0.33630.3324

0.39440.3897

GLOBALINK0.710 / 0.381

0.3199 0.42890.4277

0.31460.3144

0.36300.3624

MS0.718 / 0.382

0.3003 0.42170.4218

0.33320.3354

0.37230.3737

REVERSONA / NA

0.3823 0.47600.4756

0.36430.3653

0.41270.4132

SYSTRAN0.789 / 0.508

0.4002 0.48640.4813

0.37590.3734

0.42410.4206

Corr r(2) with [ade] – MT

0.5918 0.33990.3602

0.79660.8306

0.64790.6935

Corr r(2) with [flu] – MT

0.9807 0.96650.9721

0.89800.8505

0.98530.9699


Results for S-Score

System[ade] / [flu]

BLEU[1&2]

Prec.(w) 1/2

Recall(w) 1/2

Fscore(w) 1/2

CANDIDE0.677 / 0.455

0.3561 0.45700.4524

0.32810.3254

0.38200.3785

GLOBALINK0.710 / 0.381

0.3199 0.40540.4036

0.30860.3086

0.35040.3497

MS0.718 / 0.382

0.3003 0.39630.3969

0.32370.3259

0.35630.3579

REVERSONA / NA

0.3823 0.45470.4540

0.35630.3574

0.39960.4000

SYSTRAN0.789 / 0.508

0.4002 0.46330.4585

0.36660.3644

0.40940.4061

Corr r(2) with [ade] – MT

0.5918 0.29450.2996

0.80460.8317

0.61840.6492

Corr r(2) with [flu] – MT

0.9807 0.95250.9555

0.90930.8722

0.99420.9860


Results

• The n-gram model beats BLEU in adequacy

• The f-score metric is more strongly correlated with fluency

• Single Reference translations are stable (add stability chart?)


Conclusions

• The Bleu model can be too coarse to show differentiate between very different MT systems

• Adequacy is harder to predict than fluency

• Adding weights and using recall and f-scores can bring higher correlations with adequacy and fluency scores


References

• Chris Callison-Burch, Miles Osborne and Philipp Koehn. 2006. Re-evaluating the Role of Bleu in Machine Translation Research, to appear in EACL-06.

• Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. July 2002. pp. 311-318.

• Babych B, Hartley A. 2004. Extending BLEU MT Evaluation Method with Frequency Weighting, In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain. July 2004.

• Dan Melamed, Ryan Green, and joseph P. Turian. Precision and recall of machine translation. In Proceedings of the Human Language Technology Conference (HLT), pages 61--63, Edmonton, Alberta, May 2003. HLT-NAACL. http://citeseer.csail.mit.edu/melamed03precision.html

• Deborah Coughlin. 2003. Correlating automated andhuman assessments of machine translation quality.In Proceedings of MT Summit IX.

• LDC. 2005. Linguistic data annotation specification:Assessment of fluency and adequacy in translations.Revision 1.5


• The Brevity Penalty is designed to compensate for overly terse translations

BP = {c = length of corpus of hypothesis translationsr = effective corpus length*

Precision and Bleu

1 if c > re1-r/c if c ≤ r


• Thus, the total Bleu score is this:

BLEU = BP * exp(∑ wn log pn)

Precision and Bleu

n

n=1


Flaws in the Use of Bleu

• Experiments with Bleu, but no manual evaluation (Callison-Burch 2005)

re-evaluating bleu

Documents

mt seminarbleu

mt seminarrecall

mt seminarprecision

mt seminarvariationsspring

mt seminarvariationsbleu

mt seminarfluency

mt seminarvariationsthe

mt seminarintroductionbleu