a tale about pro and monsters preslav nakov, francisco guzmán and stephan vogel acl, sofia august 5...
TRANSCRIPT
A Tale about PRO and MonstersPreslav Nakov, Francisco Guzmán and Stephan Vogel
ACL, SofiaAugust 5 2013
2
Parameter Optimization
MERT PROMIRAkb
rampion
3
Scales to many parameters?
Fits the typical SMT
architecture?
MERT(Och, 2003)
NO YES: batch
MIRA(Watanabe et al 2007;
Chiang et al 2008)
YES NO: online
PRO(Hopkins & May 2011)
YES YES: batch
Some Parameter Optimizers for SMT
Simple but effective Increased stabilityReally?
4
PRO in a Nutshell
•A ranking problem
BLEU+1 Score
Model Score
BLEU+1 Score
Model Score
j
j ’j ’
j
New weights
two translations j and j’
According to the modelAccording to evaluation score
BLEU +1 Modelscore
5
The Original PRO Algorithm
PRO’s steps (1-3 for each sentence separately; 4 – combine all)
1. Sampling- Randomly sample 5000 pairs (j, j’) from an n-best list
2. Selection- Choose those whose BLEU+1 diff > 5 BLEU
3. Acceptance- Accept (at most) the top 50 sentence pairs (with max
differences)
4. Learning- Use the pairs for all sentences to train a ranker
Requires good training examples
A Cautionary Tale
7
MERT works just fine.
Tuning on Long Sentences …
NIST: Arabic-Englishtune on longest 50% of MT06
Tuning BLEU
Length ratio
8
…There is Evidence that…
Monsters also happenon IWSLT and Spanish-English.
PRO is unstable.
5x !!!
NIST: Arabic-Englishtune on longest 50% of MT06
MONSTERS
Tuning BLEU
Length ratio
9
…Monsters Exist…
•What?
Bad negative examples- Low BLEU- Too long
Very divergent from positive examplesNot useful for learning
•When?
- Tuning on longer sentences- Several language pairs
x1
x2
Pos
Neg
MONSTERS
10
… and Breed…
•n-best accumulation ensures monster prevalence across iterations
11
… to Ruin your Translations…
REF: but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness .
IT1: but we are that we add our ranks to some of us and that we know that in the strength and weakness in IT3:, we are the but of the that that the , and , of ranks the the on
the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and
IT4: namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never blame your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3.8 billion in tourism income proceeds Chamber are divided among themselves : some 've dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure
Image:samii69.deviantart.com
12
…and Only PRO Fears Them…NIST: Ar-En test on MT09tune on longest 50% of MT06
-3BP
Optimizing for Sentence-Level BLEU+1 Yields Short Translations(Nakov et al., COLING 2012. )
*MIRA = batch-MIRA (Cherry & Foster, 2012)
13
...but Why?
PRO’s steps
1. Sampling- Randomly sample 5000 pairs
2. Selection- Choose those whose BLEU+1 diff > 5 BLEU
3. Acceptance- Accept the top 50 sentence pairs (with max differences)
4. Learning- Use the pairs for all sentences to train a ranker
1: Change selection
2: Accept at random
Focuses on large differentials
Selects the TOP differentials
14
On Slaying Monsters
Selection
1. Cut-offs2. Filter outliers3. Stochastic sampling
Acceptance
4. Random sampling
Image:redbubble.com
15
Selection Methods: Cutoffs
• BLEU diff- BLEU diff > 5 (default)- BLEU diff < 10- BLEU diff < 20
• Length diff- length diff < 10 words- length diff < 20 words
16
Selection Methods: Outliers
•Assume gaussian•Filter outliers that are more than λ times stdev away
- λ = 2- λ = 3
outlier
λσ
Outliers
17
Selection Methods: Stochastic sampling
1. Generate empirical distribution for (j,j’)
2. Sample according to it
Select if p_rand <= p(j,j’)
18
Experimental Setup
•NIST Ar-En
•TM: NIST 2012 data (no UN)•LM: 5-gram English Gigaword v.5
•Tuning: 50% longest MT06- contrast: full MT06
•Test: MT09
3 reruns for each experiment!
19
Kill monsters
Altering Selection (Tuning on Longest 50% of MT06)
NOTE: We still require at least 5 BLEU+1 points of difference.
20
Altering Selection: Testing on Full MT09
Better BLEU,increased stability
Tuning on longest 50% Tuning on all
Same BLEU,same or better stability
NOTE: We still require at least 5 BLEU+1 points of difference.
Kill monsters
Outperforms others
47.7247.48
21
NOTE: No minimum BLEU+1 points of difference.
Random accept
kills monsters.
Random Accept (Tuning on Longest 50% of MT06)
22
Random Accept: Testing on Full MT09NOTE: No minimum BLEU+1 points of difference.
Tuning on longest 50% Tuning on all
worse BLEU,more unstable
Better BLEU,increased stabilityOutperforms
others
47.7247.48
23
Summary
•Sample based methods- Do not kill monsters- Distributional assumptions - Assume monsters are rare
•Random acceptance- Kills monsters- Decreases discriminative power - Lowers test scores on tune:full
•Simple cut-offs- Protects against monsters - Do not affect the performance on tune:full- Recommended!
24
Moral of the Tale
•Monsters: examples unsuitable
for learning•PRO’s policies to blame:
- Selection- Acceptance
•Cut-off-slaying monsters gives
also:- more stability- better BLEU
•If you use PRO you should care!
Would you risk it?
Coming to Moses 1.0 soon!
25
Thank you !Questions?