a tale about pro and monsters preslav nakov, francisco guzmán and stephan vogel acl, sofia august 5...

A Tale about PRO and MonstersPreslav Nakov, Francisco Guzmán and Stephan Vogel

ACL, SofiaAugust 5 2013

2

Parameter Optimization

MERT PROMIRAkb

rampion

3

Scales to many parameters?

Fits the typical SMT

architecture?

MERT(Och, 2003)

NO YES: batch

MIRA(Watanabe et al 2007;

Chiang et al 2008)

YES NO: online

PRO(Hopkins & May 2011)

YES YES: batch

Some Parameter Optimizers for SMT

Simple but effective Increased stabilityReally?

4

PRO in a Nutshell

•A ranking problem

BLEU+1 Score

Model Score

BLEU+1 Score

Model Score

j

j ’j ’

j

New weights

two translations j and j’

According to the modelAccording to evaluation score

BLEU +1 Modelscore

5

The Original PRO Algorithm

PRO’s steps (1-3 for each sentence separately; 4 – combine all)

1. Sampling- Randomly sample 5000 pairs (j, j’) from an n-best list

2. Selection- Choose those whose BLEU+1 diff > 5 BLEU

3. Acceptance- Accept (at most) the top 50 sentence pairs (with max

differences)

4. Learning- Use the pairs for all sentences to train a ranker

Requires good training examples

A Cautionary Tale

7

MERT works just fine.

Tuning on Long Sentences …

NIST: Arabic-Englishtune on longest 50% of MT06

Tuning BLEU

Length ratio

8

…There is Evidence that…

Monsters also happenon IWSLT and Spanish-English.

PRO is unstable.

5x !!!

NIST: Arabic-Englishtune on longest 50% of MT06

MONSTERS

Tuning BLEU

Length ratio

9

…Monsters Exist…

•What?

Bad negative examples- Low BLEU- Too long

Very divergent from positive examplesNot useful for learning

•When?

- Tuning on longer sentences- Several language pairs

x1

x2

Pos

Neg

MONSTERS

10

… and Breed…

•n-best accumulation ensures monster prevalence across iterations

11

… to Ruin your Translations…

REF: but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness .

IT1: but we are that we add our ranks to some of us and that we know that in the strength and weakness in IT3:, we are the but of the that that the , and , of ranks the the on

the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and

IT4: namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never blame your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3.8 billion in tourism income proceeds Chamber are divided among themselves : some 've dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure

Image:samii69.deviantart.com

12

…and Only PRO Fears Them…NIST: Ar-En test on MT09tune on longest 50% of MT06

-3BP

Optimizing for Sentence-Level BLEU+1 Yields Short Translations(Nakov et al., COLING 2012. )

*MIRA = batch-MIRA (Cherry & Foster, 2012)

13

...but Why?

PRO’s steps

1. Sampling- Randomly sample 5000 pairs

2. Selection- Choose those whose BLEU+1 diff > 5 BLEU

3. Acceptance- Accept the top 50 sentence pairs (with max differences)

4. Learning- Use the pairs for all sentences to train a ranker

1: Change selection

2: Accept at random

Focuses on large differentials

Selects the TOP differentials

14

On Slaying Monsters

Selection

1. Cut-offs2. Filter outliers3. Stochastic sampling

Acceptance

4. Random sampling

Image:redbubble.com

15

Selection Methods: Cutoffs

• BLEU diff- BLEU diff > 5 (default)- BLEU diff < 10- BLEU diff < 20

• Length diff- length diff < 10 words- length diff < 20 words

16

Selection Methods: Outliers

•Assume gaussian•Filter outliers that are more than λ times stdev away

- λ = 2- λ = 3

outlier

λσ

Outliers

17

Selection Methods: Stochastic sampling

1. Generate empirical distribution for (j,j’)

2. Sample according to it

Select if p_rand <= p(j,j’)

18

Experimental Setup

•NIST Ar-En

•TM: NIST 2012 data (no UN)•LM: 5-gram English Gigaword v.5

•Tuning: 50% longest MT06- contrast: full MT06

•Test: MT09

3 reruns for each experiment!

19

Kill monsters

Altering Selection (Tuning on Longest 50% of MT06)

NOTE: We still require at least 5 BLEU+1 points of difference.

20

Altering Selection: Testing on Full MT09

Better BLEU,increased stability

Tuning on longest 50% Tuning on all

Same BLEU,same or better stability

NOTE: We still require at least 5 BLEU+1 points of difference.

Kill monsters

Outperforms others

47.7247.48

21

NOTE: No minimum BLEU+1 points of difference.

Random accept

kills monsters.

Random Accept (Tuning on Longest 50% of MT06)

22

Random Accept: Testing on Full MT09NOTE: No minimum BLEU+1 points of difference.

Tuning on longest 50% Tuning on all

worse BLEU,more unstable

Better BLEU,increased stabilityOutperforms

others

47.7247.48

23

Summary

•Sample based methods- Do not kill monsters- Distributional assumptions - Assume monsters are rare

•Random acceptance- Kills monsters- Decreases discriminative power - Lowers test scores on tune:full

•Simple cut-offs- Protects against monsters - Do not affect the performance on tune:full- Recommended!

24

Moral of the Tale

•Monsters: examples unsuitable

for learning•PRO’s policies to blame:

- Selection- Acceptance

•Cut-off-slaying monsters gives

also:- more stability- better BLEU

•If you use PRO you should care!

Would you risk it?

Coming to Moses 1.0 soon!

25

Thank you !Questions?

a tale about pro and monsters preslav nakov, francisco guzmán and stephan vogel acl, sofia august 5...

Documents

meeting notes

pairs j

sentence pairs

scoremodel scorebleu

evaluation scorebleu

tuning mt06

weakness union

close ranks