Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Data-Driven Models of Machine Translation
Andy Way (& Declan Groves)
National Centre for Language Technology,School of Computing,
Dublin City University, Dublin 9, Ireland
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Outline
• Motivations• Example-Based Machine Translation
– Marker-Based EBMT
• Statistical Machine Translation• Experiments:
– Language Pairs & Corpora Used– EBMT and PBSMT baseline systems– Hybrid System Experiments
• Making use of merged data sets
• ‘Phrases’, ‘Chunks’ and Training-Test Corpora• Conclusions• Future Work
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Motivations
• Most MT research carried out today is corpus-based:– Example-Based Machine Translation (EBMT)– Statistical Machine Translation (SMT)
• Lack of comparative research:– Relative unavailability of EBMT systems– Lack of participation of EBMT researchers in
competitive evaluations– Dominance of the SMT approach
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Example-Based Machine Translation
• As with SMT, EBMT makes use of information extracted from sententially-aligned bilingual corpora. In general:
– SMT only uses parameters, throws away data– EBMT makes use of linguistic units directly
• During Translation:1. Source side of bitext is searched for close matches2. Source-target subsentential links are determined3. Relevant target fragments retrieved and recombined
to derive final translation.
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT: An Example• Assumes an aligned bilingual corpus of examples against
which input text is matched• Best match is found using a similarity metric based on word
co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT: An Example• Assumes an aligned bilingual corpus of examples against
which input text is matched• Best match is found using a similarity metric based on word
co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)
Given the Corpus
The shop is open on Monday Le magasin est ouvert Lundi
John went to the swimming pool Jean est allé à la piscine
The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT: An Example
• Identify useful fragments
Given the CorpusThe shop is open on Monday Le magasin est ouvert Lundi
John went to the swimming pool Jean est allé à la piscine
The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT: An Example
Isolate useful fragments
We can now translate:
on Monday lundiJohn went to Jean est allé àthe baker’s la boulangerie
Given the CorpusThe shop is open on Monday Le magasin est ouvert lundi
John went to the swimming pool Jean est allé à la piscine
The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie
• Identify useful fragments• Recombination depends on nature of examples used
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT at DCU
Marker-Based EBMT at DCU
• Gaijin: [Veale & Way], RANLP ‘97• [Gough et al.], AMTA ‘02• wEBMT: [Way & Gough], Comp. Linguistics ‘03• [Gough & Way], EAMT ‘04• [Way & Gough], TMI ‘04• [Gough], PhD Thesis ‘05• [Way & Gough], Natural Language Engineering ‘05• [Way & Gough], Machine Translation ‘05• [Groves & Way], ACL w/shop on Data-Driven MT ‘05• [Groves & Way], Machine Translation & EAMT ‘06• MaTrEx: [Armstrong et al.], TC-STAR OpenLab ‘06• [Stroppa et al.], NIST MT-Eval ‘06, AMTA ’06,
IWSLT-06
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
System Development
System Lang. Pairs #Sent. Pairs
Gaijin ‘97 ENDE 1836
wEBMT ‘03 FREN 219K Penn-II NPs, VPs
TMI-04 FREN 203,000
ACL-05 FREN 322,000
MaTrEx OpenLab ESEN 958,000
MaTrEx NIST-06 AREN 3,000,000
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
System Development
System Lang. Pairs #Sent. Pairs
Gaijin ‘97 ENDE 1836
wEBMT ‘03 FREN 219K Penn-II NPs, VPs
TMI-04 FREN 203,000
ACL-05 FREN 322,000
MaTrEx OpenLab ESEN 958,000
MaTrEx NIST-06 AREN 3,000,000
MaTrEx AMTA-06 BasqueEN 276,000
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
System Development
System Lang. Pairs #Sent. Pairs
Gaijin ‘97 ENDE 1836
wEBMT ‘03 FREN 219K Penn-II NPs, VPs
TMI-04 FREN 203,000
ACL-05 FREN 322,000
MaTrEx OpenLab ESEN 958,000
MaTrEx NIST-06 AREN 3,000,000
MaTrEx AMTA-06 BasqueEN 276,000
MaTrEx IWSLT-06 ITEN 40,000
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT
“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes
which appear in a limited set of grammatical contexts and which signal that context.”
[Green, 1979]• Universal psycholinguistic constraint: languages are marked for
syntactic structure at surface level by closed set of lexemes or morphemes
The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant.
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT
“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes
which appear in a limited set of grammatical contexts and which signal that context.”
[Green, 1979]• Universal psycholinguistic constraint: languages are marked for
syntactic structure at surface level by closed set of lexemes or morphemes
The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant.
•Three NPs start with determiners, one with a possessive pronoun•Nominal element will appear soon to the right•Sets of determiners and possessive pronouns small and finite
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT
“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes
which appear in a limited set of grammatical contexts and which signal that context.”
[Green, 1979]• Universal psycholinguistic constraint: languages are marked for
syntactic structure at surface level by closed set of lexemes or morphemes
The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant.
•Four prepositional phrases, with prepositional heads•NP object will appear soon to the right•Set of prepositions small and finite
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT: Chunking
• Use a set of closed-class marker words to segment aligned source and target sentences during a pre-processing stage
• <PUNC> now used as end-of-chunk marker
Determiners <DET>
Quantifiers <QUANT>
Prepositions <PREP>
Conjunctions <CONJ>
Wh-Adverbs <WRB>
Possessive Pronouns <POSS>
Personal Pronouns <PRON>
Punctuation Marks <PUNC>
• English Marker words extracted from CELEX
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT: Chunking (2)• Enables the use of basic syntactic markup for extraction of
translation resources• Source-target sentence pairs are tagged with marker
categories in pre-processing stageEN: <PRON> you click apply <PREP> to view <DET> the effect <PREP> of <DET> the selectionFR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser <DET>l’ effet <PREP> de <DET> la sélection
• Aligned source-target chunks created by segmenting sentences based on these marker tags along with cognate and word co-occurrence information: <PRON> you click apply : <PRON> vous cliquez sur appliquer
<PREP> to view : <PREP> pour visualiser <DET> the effect : <DET> l’effet <PREP> of the selection : <PREP> de la sélection
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT: Chunking (2)• Enables the use of basic syntactic markup for extraction of
translation resources• Source-target sentence pairs are tagged with marker
categories in pre-processing stageEN: <PRON> you click apply <PREP> to view <DET> the effect <PREP> of <DET> the selectionFR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser <DET>l’ effet <PREP> de <DET> la sélection
• Aligned source-target chunks created by segmenting sentences based on these marker tags along with cognate and word co-occurrence information: <PRON> you click apply : <PRON> vous cliquez sur appliquer
<PREP> to view : <PREP> pour visualiser <DET> the effect : <DET> l’effet <PREP> of the selection : <PREP> de la sélection
• Chunks must contain at least one non-marker word—ensures chunks contain useful contextual information
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT: Lexicon & Template Extraction
• Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon:
<PREP> to: <PREP> pour<LEX> view: <LEX> visualiser<LEX> effect: <LEX> effet<DET> the: <DET> l<PREP> of: <PREP> de
• In a final pre-processing stage, we produce a set of generalized marker templates by replacing marker words with their tags:
<PRON> click apply : <PRON> cliquez sur appliquer <PREP> view : <PREP> visualiser<DET> effect : <DET> effet <PREP> the selection : <PREP> la sélection
• Any marker word pair can now be inserted at the appropriate tag location.
• More general examples add flexibility to the matching process and improve coverage (and quality)
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Marker-Based EBMT
• During translation:– Resources are searched from maximal (specific source-
target sentence-pairs) to minimal context (word-for-word translation).
– Retrieved example translation candidates are recombined, along with their weights, based on source sentence order
– System outputs n-best list of translations
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Phrase-Based SMT
• SMT translation and language models now make use of phrase-translations in TM, along with word correspondences, to improve translation output.– Better modelling of syntax and local word-reordering
• Phrase extraction heuristics based on word alignments shown to be better than more syntactically motivated approaches [Koehn et al., 2003]– Perform word alignment in both source-target and target-
source directions– Take intersection of unidirectional alignments– Extend the intersection iteratively into the union by adding
adjacent alignments within the alignment space [Och & Ney 2003, Koehn et al., 2003].
– Extract all possible phrases from sentence pairs which correspond to these alignments
– Phrase probabilities can be calculated from relative frequencies
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Outline: Recap
• Motivations• Example-Based Machine Translation
– Marker-Based EBMT
• Statistical Machine Translation• Experiments:
– Language Pairs & Corpora Used– EBMT and PBSMT baseline systems– Hybrid System Experiments
• Making use of merged data sets
• ‘Phrases’, ‘Chunks’ and Training-Test Corpora• Conclusions• Future Work
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Experiments
Publication Training Corpus
Language Pair
Rationale
Way & Gough NLE-05
203K-sent. Sun TM
ENFR How does EBMT fare compared to WB-SMT?
Groves & Way
ACL SMT-05
203K-sent. Sun TM
ENFR How does EBMT fare compared to PB-SMT? What about combining EBMT & SMT chunks?
Groves & Way
MT-06
322K-sent. Europarl
ENFR How does changing domain affect all this?
Armstrong et al.
OpenLab-06
958-K sent. Europarl
ESEN What about a different language pair & more training data?
Stroppa et al.AMTA-06
273-K sentEF TM
BasqueEN
What about a more different language pair?
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT vs. WB-SMT
• [Way & Gough, 05] (cf. talk here in May 05): on 203K-$ Sun TM (4.8M words), and a 4K-$ test set (ave. $-length 13.1 words EN, 15.2 words FR), EBMT>vanilla WB-SMT (Giza++, CMU-Cambridge statistical toolkit, ISI ReWrite Decoder) for FREN
• Best BLEU scores: – ENFR: .453 EBMT, .338 WB-SMT– FREN: .461 EBMT, .446 WB-SMT
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT & PB-SMT (on Sun TM)English-French
• The Phrase-Based system using GIZA-Data outperforms the same system seeded with EBMT-Data on all metrics, bar Precision (0.6598 vs. 0.6661)
• Marker-Based EBMT system beats both Phrase-Based SMT systems, particularly for BLEU (0.4409 vs. 0.3758) and Recall (0.6877 vs. 0.5759).
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bleu Prec Recall WER
PBSMT (GIZA)1.73M entries
PBSMT (EBMT)403,278 entries
EBMT
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT & PB-SMT (on Sun TM)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bleu Prec Recall WER
PBSMT (GIZA)1.73M entries
PBSMT (EBMT)403,278 entries
EBMT
French-English
• Scores for all systems are better for FREN than for ENFR
• Again, the Phrase-Based system using GIZA data outperforms the same system seeded with EBMT data.
• As for ENFR, the Marker-Based EBMT system significantly outperforms both Phrase-Based SMT systems for FREN.
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Towards Hybridity
• Decided to merge data sources– Combine parts of EBMT sub-sentential alignments with
parts of the data induced using GIZA++• Performed a number of experiments using:
– EBMT Phrases + GIZA++ Words (SEMI-HYBRID)• Investigate if quality of EBMT phrases is better than GIZA++
phrases– All Data (HYBRID); GIZA++ Words & Phrases + EBMT
Words & Phrases • EBMT phrases will be used instead of SMT n-grams • EBMT phrases should add extra probability to ‘more useful’
SMT phrases; i.e. the probabilities of the phrases in the intersection of these two sets are boosted
EBMTPhrases
Giza++Phrases
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Merging Data Sources: ENFR Results
• Using EBMT phrases + GIZA words improves significantly on using EBMT data alone
• Merging all the EBMT and GIZA data improves on all metrics, most significantly for BLEU score (0.4259 vs. 0.3643 SEMI-HYBRID).
• EBMT system still wins out for BLEU score, Recall and WER
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bleu Prec Recall WER
PBSMT (EBMT)403,278 entries
PBSMT (GIZA)1.7M entries
SEMI-HYBRID430,336 entries
HYBRID 2.05M entries
EBMT system
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Merging Data Sources: FREN Results
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bleu Prec Recall WER
PBSMT (EBMT)403,278 entries
PBSMT (GIZA)1.7M entries
SEMI-HYBRID430,336 entries
HYBRID 2.05M entries
EBMT system
• Using EBMT phrases + GIZA words shows improvements on PBSMT system seeded with EBMT data, but improves only on the GIZA seeded system’s BLEU score (0.4888 vs. 0.4198).
• However, merging all data improves on both PBSMT systems on all metrics
• EBMT system beats Hybrid system only on Recall and WER
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Results: Discussion• PBSMT
– Best PBSMT BLEU scores (with Giza++ data only): 0.375 (E-F), 0.420 (F-E);
– Seeding PBSMT with EBMT data gets good scores: for BLEU, 0.364 (E-F), 0.395 (F-E); note differences in data size (1.73M vs. 403K)
– PBSMT loses out to EBMT system
• Semi-Hybrid System– Seeding Pharaoh with SMT words and EBMT phrases
improves over baseline Giza++ seeded system;– Data size diminishes considerably (430K vs. 1.73M);– Worse results than for EBMT system.
• Fully-Hybrid System– Better results than for ‘semi-hybrid’ system: E-F 0.426
(0.396), F-E 0.489 (0.427);– Data size increases to 2.04M phrase table entries– For F-E, Hybrid system beats EBMT on BLEU (0.4888 vs.
0.4611) & Precision (0.6927 vs. 0.6782); EBMT ahead for Recall & WER.
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT & PB-SMT (on Europarl)
• [Groves & Way, 06a/b]1. Added SMT-chunks to EBMT system hybrid ‘statistical
EBMT’ system2. New domain: Europarl (FREN, 322K-$ ) [Koehn, 05]• Extracted training data from designated training sets,
filtering based on sentence length and relative sentence length (ratio of 1.5 used).
– Allowed us to extract high-quality training sets
# sentence pairs # words
78K 1.49M
156K 2.98M
322K 6.12M
• For testing, randomly extracted 5000 sentences from the Europarl common test set. Avg. sentence lengths: 20.5 words (French), 19.0 words (English)
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT vs. PBSMT
• Compared the performance of our Marker-Based EBMT system against that of a PB-SMT system built using:– Pharaoh Phrase-Based Decoder [Koehn, 04] – SRI LM toolkit [Stolcke, 02].– Refined alignment strategy [Och & Ney, 03]
• Trained on incremental data sets, tested on 5000 sentence test set– Effect of increasing training data on translation quality
• Performed translation for FREN• Evaluated translation quality automatically using
BLEU [Papineni et al., 02], Precision & Recall (GTM toolkit [Turian et al., 03]) and Word-error rate (WER)
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT vs. PBSMT: French-English
0
0.1
0.2
0.3
0.4
0.5
0.6
Bleu Prec Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
Bleu Prec Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
Bleu Prec Recall
78K
156K
322K
• Doubling the amount of data improves performance across the board for both EBMT and PBSMT
• PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher
• PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set)
• Increasing amount of training data results in:– 3-5% increase in relative BLEU for
PBSMT– 6.2% to 10.3% relative BLEU score
improvement for EBMT
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WER
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WER
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WER
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
EBMT vs. PBSMT: English-French
0
0.1
0.2
0.3
0.4
0.5
0.6
Bleu Prec Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
Bleu Prec Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
Bleu Prec Recall
78K
156K
322K
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WER
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WER
• PBSMT continues to outperform EBMT system by some distance– e.g. 0.1933 vs. 0.1488 BLEU score, 0.518
vs. 0.4578 Recall for 322K data set• Difference between systems is somewhat
less for ENFR than for FREN– EBMT system performance much more
consistent for both directions– PBSMT system performs 2% BLEU score
worse (10% relative) for ENFR than for FREN
• French-English is ‘easier’– Fewer agreement errors, problems with
boundary friction e.g. le the (FREN), the le, la, les, l’ (ENFR)
• EBMT scores higher for ENFR than for FREN in terms of BLEU score
– Cf. [Callison-Burch et al., 06], BLEU for evaluating non-n-gram-based systems
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WER
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid System Experiments
• Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++
• Number of Hybrid Systems– LEX-EBMT: Replaced EBMT lexicon with higher
quality PBSMT word-alignments, to lower WER– H-EBMT vs. H-PBSMT: Merged PBSMT words and
phrases with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems
– H-EBMT-LM: Reranked the output of H-EBMT systems using the PBSMT system’s equivalent language model
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Experiments: French-English
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Experiments: French-English
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Experiments: French-English
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Experiments: French-English
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
• Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU)
• Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT)– H-PBSMT system achieves higher BLEU score trained on 78K &
156K compared with PBSMT system when trained on twice as much data.
• The addition of the language model to the H-EBMT system helps guide word order after lexical selection and thus improves results further
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Experiments: English-French
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
• We see similar results for ENFR as for FREN– The more SMT-like the EBMT system becomes, the more the
BLEU scores fall in line with other metrics, i.e. higher for FREN than for ENFR
• Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline
• The H-PBSMT system performs almost as well as the baseline system trained on over 4 times the amount of data
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
SMT ‘phrases’ vs. EBMT ‘chunks’
SMT EBMT BOTH SMT-ONLY EBMT-ONLY
78K 1.17M 242,907 47,311 1.12M 195,596
156K 2.45M 470,588 92,662 2.36M 378,026
322K 5.15M 928,717 181,669 4.97M 747,048
• Many more SMT phrases are derived than EBMT chunks– Not reflected in scores
• Doubling amount of data, doubles amount of sub-sentential alignments for both systems– Indicates the heterogeneous nature
of the Europarl corpus• Taking the 322K training set :
– 93.0% SMT chunks found only once, 99.4% occur < 10 times
– 96.6% EBMT chunks found only once, 99.8% occur < 10 times
• Of the top 10 most frequent chunks in SMT-only set, 7 are made up solely of marker words:du of thede la of theunion européenne unionétats membres member statesde l of thedans le in then est isparlement européen parliamentque nous that weque la that the
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Remarks
• [Groves & Way, 05] showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set
• This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system– Heterogeneous Europarl vs. Homogeneous Sun data– Chunk coverage is lower on Europarl data set: 6% translations
produced using chunks alone (Sun) vs. 1% on Europarl– EBMT system considered 13 words on average for direct
translation (vs. 7 for Sun data)• Significant improvements seen when using higher-quality lexicon• Improvements also seen when LM introduced
• H-PBSMT system able to outperform baseline PBSMT system• Further gains to be made from hybrid corpus-based
approaches– Small overlap on chunks extracted via EBMT and SMT methods
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid ‘Example-Based SMT’: The MaTrEx system
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Hybrid Example-Based SMT
• [Armstrong et al., 06]: OpenLab MT-EVAL (March 06)—adding EBMT chunks to ‘vanilla Pharaoh’ PB-SMT system adds about 4 BLEU points for ESEN
• [Stroppa et al., 06]: adding EBMT chunks to ‘vanilla Pharaoh’ PB-SMT system adds about 5 BLEU points for BasqueEN
• Good performance in IWSLT-06
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Outline: Recap
• Motivations• Example-Based Machine Translation
– Marker-Based EBMT
• Statistical Machine Translation• Experiments:
– Language Pairs & Corpora Used– EBMT and PBSMT baseline systems– Hybrid System Experiments
• Making use of merged data sets
• ‘Phrases’, ‘Chunks’ and Training-Test Corpora• Conclusions• Future Work
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
‘Phrases’, ‘Chunks’ and Training-Test Corpora
• SMT phrases are contiguous sequences of n-grams• Typically, EBMT performance is comparable with PB-
SMT with fewer sub-sentential alignments• As EBMT chunks are different from SMT ‘phrases’, use
them if available in your PB-SMT systems (cf. OpenLab ESEN and AMTA BasqueEN results). They:– Provide longer sequences of context better translations– Reinforce probability of good but infrequent SMT ‘phrases’
• As SMT ‘phrases’ are different from EBMT chunks, use them if available in your EBMT systems
• SMT ‘phrases’ typically shorter than EBMT chunks, so more useful where training/test material is more heterogeneous—where EBMT chunks are ‘too long’ to cover the input data, SMT n-grams can fill in before we need to resort to W2W translation (always last resort)
• cf. CMU findings in recent NIST MT-Eval …
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
‘Phrases’, ‘Chunks’ and Training-Test Corpora
• Looks like EBMT better on homogeneous training data:– EBMT > PB-SMT on Sun TM (ENFR)– EBMT > PB-SMT on EF TM (BasqueEN)
• SMT better on (more) heterogeneous data– PB-SMT > EBMT on Europarl (ENFR)
• Predictors of Usefulness of Approach given Text Type:– Chunk coverage– Amount of W2W Translation
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Conclusions
• Combining SMT ‘phrases’ and EBMT chunks in a hybrid ‘statistical EBMT’ or ‘example-based SMT’ system will improve your system output
• Blind adherence to one approach will guarantee that your performance is less than it could otherwise be
• John Hutchins: “EBMT is Hybrid MT”• Joe Olive: “Need combination of ‘rules’
and statistics”
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Ongoing & Future Work
• Automatic detection of Marker Words– Most common SMT phrases consist mainly of marker
words
• Plan to increase levels of hybridity– Code a simple EBMT decoder, factoring in Marker-Based
recombination approach along with probabilities– Use exact sentence matching in PBSMT, as in EBMT– Integration of generalized templates into PBSMT system
(and reintegrate them into EBMT system)– Integrate marker tag information into SMT language and
translation models – Hybrid EBMT-EBMT System (with CMU)?!
• What’s the contribution of EBMT chunks if an SMT system is allowed as much training data as it likes?
Andy Way, IGK Summer School, Edinburgh, Sept. 2006
Thank you for your attention.