d1.1 - first report on self-tuning mt

This document is part of the Project “Machine Translation Enhanced Computer Assisted Translation (MateCat)”, funded by the 7th Framework Programme of the European Commission through grant agreement no.: 287688.

Machine Translation Enhanced

Computer Assisted Translation

D1.1 - First report on self-tuning MT

Authors: Holger Schwenk, Christophe Servant, Loïc Barrault, Mauro Cettolo

Dissemination level: Public

Date: December 5th, 2012

Machine Translation Enhanced Computer Assisted Translation D1.1 First report on self-tuning MT

2

Grant agreement no. 287688 Project acronym MateCat Project full title Machine Translation Enhanced Computer Assisted Translation Funding scheme Collaborative project Coordinator Marcello Federico (FBK) Start date, duration November 1st, 2011, 36 months Dissemination level Public Contractual date of delivery October 31st, 2012 Actual date of delivery December 5th, 2012 Deliverable number D1.1 Deliverable title First report on self-tuning MT Type Report Status and version Final, V1.2 Number of pages 36 Contributing partners LE MANS, FBK WP leader LE MANS Task leader LE MANS Authors Holger Schwenk, Christophe Servant, Loïc Barrault, Mauro Cettolo Reviewers Christian Buck, Marco Turchi EC project officer Kimmo Rossi The partners in MateCat are: Fondazione Bruno Kessler (FBK), Italy Université Le Mans (LE MANS), France The University of Edinburgh (UEDIN), UK Translated S.r.l. (TRANSLATED), Italy

For copies of reports, updates on project activities and other MateCat-related information, contact:

FBK MateCat Marcello Federico [email protected] Povo - Via Sommarive 18 Phone: +39 0461 314 552 I-38123 Trento, Italy Fax: +39 0461 314 591

Copies of reports and other material can also be accessed via http://www.matecat.com

© 2012, Holger Schwenk, Christophe Servant, Loïc Barrault, Mauro Cettolo No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.

Machine Translation Enhanced Computer Assisted TranslationD1.1: Self-Tuning MT

Executive Summary

The goal of this work package is to create operational SMT systems for integration into the CATtool and to develop various approaches to adapt these systems to the domains of interest. Thecreation of the initial SMT systems for four language pairs (English into French, German, Italianand Spanish) and two domains (legal and information technology) was a major effort. Thisincluded the identification, collection and cleaning of appropriate training material; definitionof training, development and test subsets; and the creation of baseline and various adapted SMTsystems. These systems were integrated into the operational CAT tool, in close cooperation withthe work package 4 (MT-CAT Integration).

We have been working on various methods to adapt generic SMT systems to the two do-mains. These techniques include data selection and weighting of parallel and monolingual data,various model update algorithms, mixture approaches and new statistical models operating inthe continuous space. Applying all these methods, we were able to achieve significant improve-ments of the translation quality with respect to generic SMT systems. We have also developedtechniques to continuously adapt the SMT systems during the lifetime of a translation project.These approaches have been validated in laboratory and field tests.

Two papers have been already published during the first year of the project [Schwenk, 2012,Blain et al., 2012]. Results from recent research will be submitted to major conference of thefield within the next months (NAACL, ACL). Some of the research performed by the partnersduring the last year was not directly financed by MateCat, yet the project benefits from thisresearch and the corresponding techniques have been successfully applied to MateCat [Bisazzaet al., 2011, Shah et al., 2012, Schwenk et al., 2012].

3


Contents

1 Introduction 5

2 Description of the tasks 5

2.1 Available data resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Adaptation Methods 16

3.1 Mono- and bilingual data selection . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Weighting parallel data (off-line) . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Weighting parallel data (on-line) . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Continuous space methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5 Fast incremental update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6 Fill-up for Phrase-based SMT Adaptation . . . . . . . . . . . . . . . . . . . . 223.7 Mixture LMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Baseline Systems 23

4.1 English–to–Italian, IT domain . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 English–to–German, IT domain . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 English–to–Spanish, IT and Legal domains . . . . . . . . . . . . . . . . . . . 254.4 English–to–Italian, Legal domain . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 English–to–German, Legal domain . . . . . . . . . . . . . . . . . . . . . . . . 264.6 English–to–French, Legal domain . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6.1 Corpus Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Adapted Systems 28

5.1 English–to–Italian, IT domain . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 English–to–Italian, Legal domain and English–to–Spanish, IT/Legal domains . 315.3 English–to–German, Legal domain . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3.1 Lab test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4


1 Introduction

The goal of this work package is to develop methods and system architectures to adapt statisticalmachine translation (SMT) systems. We distinguish several types of adaptation:

1. Domain adaptation: This type of adaptation encompasses all activity in order to adapta general purpose SMT system to a particular domain, in our case legal and informationtechnology.

2. Project adaptation: This adaptation is performed iteratively during the lifetime of antranslation project. Typically, after a day of work by a human translator, knowledge aboutthe newly translated text is injected into the SMT systems so that they will propose im-proved translations the next day. This procedure is continued until the end of the project.

3. User adaptation: during the interaction between the user and the MateCat Tool, the MTengine will continuously and silently adapt to the user. Suggestions provided by the MTengine will in fact take into account already translated and approved segments, as wellas corrections made by the user to previous outputs of the TM or the MT engine. In thisway, the MT engine will effectively avoid replicating the same errors for the same words,segment after segment. This type of adaptation is focus of WP2 and will be described indetail in the document D2.1 due at month 22. First research on this topic is summarizedin section 3.5 in this document.

During the first year of the MateCat project, we have been jointly working on various meth-ods to achieve these goals. Some of these methods only apply to a particular type of adaptation,e.g. because they are too slow for user adaptation or because they require sufficient amountsof data to be efficient, while others may be potentially applied in all three adaptation scenarios.Therefore, we propose to describe first the algorithmic details of all the methods in section 3.Results using those methods are then presented in sections 4 and 5 respectively. Before, wesummarize in the next section the language pairs, tasks and available resources.

2 Description of the tasks

Although MateCat will not focus on language-dependent core technology for statistical MT,it will cover a selection of languages in order to prove generality of the developed and testedsolutions. Table 1 summarizes the language pairs and domains of interest in this project, asmentioned in the description of work (DoW).

5


Translation directions Partner Domains Lab tests Field test

English-to-Italian FBK Legal/IT yes yesEnglish-to-German Le Mans Legal/IT yes yesEnglish-to-French Le Mans Legal/IT/Medical yes noEnglish-to-Spanish FBK Legal/IT/Medical yes no

Table 1: Summary of language-pairs covered by MateCat.

The systems to translate into Italian and Spanish have been developed by FBK while theUniversity of Le Mans addressed the translation into French and German. Work on the medicaldomain will be performed in the second and third year of the project in order to first focusresearch on generic algorithms and not to dilute efforts on basic system development.

Lab test were performed for all language pairs and domains, i.e. comparing different vari-ants of our systems and measuring progress on well-defined development and test sets. Inaddition, realistic field tests were performed on a subset of these conditions in order to measurethe productivity gain of human translators by integrating adapted SMT systems into the CATworkflow. The legal and IT domain represent relevant sectors in the translation industry andthe industrial partner of the project, Translated, has at its disposal real-life translation projectsfor the English-Italian and English-German language pairs. Both domains are suitable for ex-ploiting statistical MT, i.e. the information source is sufficiently homogeneous, the languageis sufficiently complex, and there is sufficient multilingual data available to train and tune MTsystems.

As expected, machine translation into German is significantly more challenging than theother language pairs. In addition, it has turned out that there is a large mismatch between theprojects available for the field tests and the data available to train and adapt the English-to-German system for the legal domain. Therefore, we decided to postpone the field test for thiscondition. This is explained in detail in the report on first lab and field tests, deliverable D5.3.

For all domains and language pairs, MateCat relies on existing language resources, includ-ing parallel corpora and translation memories. For the legal domain the publicly available JRC-Acquis collection [Steinberger et al., 2006] has been used, which includes most EU legislativetexts translated into 22 languages. For the information technology domain, in addition to smallpublicly available corpora, proprietary data sets have been employed (software documentationin general). More details are provided in the next section.

2.1 Available data resources

In the following, we list the resources used to develop SMT systems for the four language pairs(English into French, German, Italian and Spanish) and two domains (legal and information

6


technology).

IT domain, English–to–Italian

Most of text corpora were provided by Translated. In particular, we employed:

• Translation Memory (TM): a large collection of parallel entries related to the IT domainbuilt during the real use of the CAT tool; it is intended to be used for training purposes

• 4 customer projects (OPUS): parallel documents from specific customers

– 3 from KDE: KDE4, KDE4-GB (which differs from KDE4) and KDEdoc;

– 1 from PHP;

• 6 IT projects (ITprjcts): parallel documents coming from 6 real projects; in each doc-ument, segments are kept in their original order; they are intended to be used for train-ing/development purposes

• 1 specific IT project (FLDprjct): a parallel document from the specific customer whichthe field test will be focused on; it is intended to be used for evaluation purposes

Statistics of these corpora are reported in Table 2.

#segments #source #targetwords words

TM + OPUS all entries (wd) 5.5M 63.8M 66.6Mno duplicates (wod) 1.9M 27.8M 29.0M

ITprjcts 4.1k 56.0k 60.5kprjct0 421 7.4k 7.5kprjct1 931 11.3k 13.1kprjct2 380 8.9k 9.7kprjct3 289 4.0k 4.1kprjct4 1184 13.2k 14.8kprjct5 864 11.3k 11.3k

FLDprjct 1789 18.0k 18.7kblock1 800 5.1k 5.4kblock2 989 12.9k 13.3k

Table 2: Overall statistics on English–Italian parallel data of the IT domain used for trainingand testing the SMT system. Counts of (English) source words and (Italian) target words referto tokenized texts. Symbols k and M stand for 103 and 106, respectively.

7


IT domain, English–to–Spanish

Most of text corpora were provided by Translated. In particular, we employed:

• Translation Memory (TM): a large collection of parallel entries related to the IT domainbuilt during the real use of the CAT tool; it is intended to be used for training purposes

• 6 customer projects (OPUS): parallel documents from specific customers

– 3 from KDE: KDE4, KDE4-GB (which differs from KDE4) and KDEdoc;

– 1 from PHP;

– 2 from Open Office: OpenOffice and OpenOffice3 (part not in OpenOffice).

• 5 IT projects (IT-trn-prjcts): parallel documents coming from 5 real projects; in eachdocument, segments are kept in their original order; they are intended to be used fortraining purposes

• 1 specific IT project (IT-tst-prjct): a parallel document used for blind lab testing, simulat-ing the real field test

Statistics of these corpora are reported in Table 3.


TM + OPUS + IT-trn-prjcts (wod) 701k 8.3M 9.3MIT-tst-prjct 389 6.1k 6.1k

block1 200 2.6k 2.6kblock2 189 3.5k 3.6k

Table 3: Overall statistics on English–Spanish parallel data of the IT domain used for trainingand testing the SMT system. Counts of (English) source words and (Spanish) target words referto tokenized texts. Symbols k and M stand for 103 and 106, respectively.

IT domain, English–to–German

Most of the bilingual corpora were provided by Translated. For our Translation Model, weused:

• Translation Memory (TM-IT): a collection of parallel segments of the IT domain builtduring the use of the CAT tool by professional translators;

8


• 6 IT projects (ITprjcts): parallel documents coming from 6 real projects; in each docu-ment, segments are kept in their original order;

• 6 customer projects: parallel documents from specific customers

– 3 from KDE: KDE4, KDE4-GB (part not in KDE4) and KDEdoc;

– 1 from PHP;


• 2 corpora from the WMT’12 evaluation:

– News-Commentary V.7 (NC7)

– Europarl V.7 (EP7)

The WMT’12 evaluation campaign provided us with some corpora which were used as out-of-domain data for our data selection algorithm. The description of the data is presented inTable 4.

IT domain, English–to–French

Most of the bilingual corpora were provided by Translated. For our Translation Model, weused:

• Translation Memory (TM-IT): a collection of parallel segments of the IT domain builtduring the use of the CAT tool by professional translators;

• 2 IT projects (ITprjcts): parallel documents coming from 6 real projects; in each docu-ment, segments are kept in their original order;



– 1 from PHP;





– 109 French–English corpus (109)

9


Project Corpus #segments #source #targetwords words

TM-IT all entries 147k 2.8M 2.7MKDE all entries 271k 2.5M 2.5M

KDE4 217k 2M 2MKDE4-GB 51k 484k 452kKDE4doc 2.8k 38k 41k

Open Office all entries 98k 905k 870kOpenOffice 41k 448k 436kOpenOffice3 57k 456k 433k

PHP all entries 38k 296k 302kITpjs all entries 5.5k 59k 51k

ITpj0 1677 22k 17kITpj1 1184 15k 13kITpj2 2108 15k 15kITpj3 112 1.1k 1.0kITpj4 273 3.1k 3.0kITpj5 165 1.8k 1.8k

WMT’12 all entries 2.0M 56.9M 54.5MNC7 158k 3.8M 3.9MEP7 1.9M 53.0M 50.5M

Table 4: Overall statistics on English–German parallel data of the IT domain used for trainingand testing the SMT system. Counts of source (English) words and target (German) words referto tokenized texts. Symbols k and M stand for 103 and 106, respectively.

The WMT’12 evaluation campaign provided us some corpora which were used as out-of-domain data for our data selection algorithm. The description of the data are presented inTable 5.

Also, additional monolingual data was used:

• 2 corpora from the LDC French Gigaword corpus (APW and AFP)

• 1 from the WMT’12 evaluation campaign: the News Crawled Corpus (NEWS)

Their description is shown in Table 6.

Legal domain, English–to–Italian and English–to–Spanish

The JRC-Acquis collection [Steinberger et al., 2006] has been used for both training and evalua-tion purposes. The corpus is provided with two different alignments at segment level computedby means of free tools. A preliminary investigation suggested to re-align it by using the Gar-gantua software [Braune and Fraser, 2010]. From the original corpus of each language pair,

10



TM-IT all entries 150k 2.7M 3.1MKDE all entries 254k 2.2M 2.6M

KDE4 206k 1.8M 2.2MKDE4-GB 45k 365k 415kKDE4doc 2k 28k 30k

Open Office all entries 100k 899k 1.0MOpenOffice 40k 422k 491kOpenOffice3 60k 477k 527k

PHP all entries 40k 364k 398kITpjs all entries 8.9k 57k 64k

ITpj0 1677 27k 30kITpj1 1184 30k 34k

WMT’12 all entries 24M 644M 760MNC7 137k 3.4M 4.0MEP7 2M 55.7M 61.7M109 21.7M 585M 695M

Table 5: Overall statistics on English–French parallel data of the IT domain used for trainingand testing the SMT system. Counts of source (English) words and target (French) words referto tokenized texts. Symbols k and M stand for 103 and 106, respectively.

Project Corpus #segments #targetwords

LDC French Gigaword all entries 20.7M 795MAFP 16M 585MAPW 4.7M 210M

WMT’12 NEWS 16M 409M

Table 6: Overall statistics on English–French additional monolingual data of the IT domainused for training and testing the SMT system. Counts target language (French) words refer totokenized texts. The symbol M stands for 106.

a document has been selected for development/evaluation purposes, checking that its size wasadequate for our purposes and its membership of a not too large nor too small Eurovoc1 subjectdomain class. Each of these documents have been split into two blocks, one used for develop-ment, the other for evaluation. Tables 7 and 8 provide some statistics on these parallel texts.

1http://eurovoc.europa.eu/

11



Acquis (wod) 1.5M 47.6M 49.3MLegal-tst-prjct (Eurovoc class = 4040) 769 23.2k 24.2k

block1 290 10.0k 10.6kblock2 479 13.2k 13.6k

Table 7: Overall statistics on English-Italian parallel data of the Legal domain used for trainingand testing the SMT system. Counts of (English) source words and (Italian) target words referto tokenized texts. Symbols k and M stand for 103 and 106, respectively.


Acquis (wod) 1.6M 48.2M 53.3MLegal-tst-prjct (Eurovoc class = 1338) 120 6.0k 7.2k

block1 90 3.2k 3.7kblock2 30 2.8k 3.5k

Table 8: Overall statistics on English-Spanish parallel data of the Legal domain used for trainingand testing the SMT system. Counts of (English) source words and (Spanish) target words referto tokenized texts. Symbols k and M stand for 103 and 106, respectively.

Legal domain, English–to–German

Most of the bilingual corpora were provided by Translated. For the translation model, we used:

• Translation Memory (TM-LEGAL): a collection of parallel segments of the LEGAL do-main built during the use of the CAT tool by professional translators;

• 9 LEGAL projects (LEGALpjts): parallel documents coming from 9 real projects; in eachdocument, segments are kept in their original order;

• 2 customer projects : parallel documents from specific customers

– 1 from ECB;

– 1 from Acquis;


– News-Commentary V.7 (NC7);

– Europarl V.7 (EP7);

12


The WMT’12 evaluation campaign provided us some corpora which were used as out-of-domain data for our data selection algorithm. The Acquis corpus has been realigned with theGargantua software Braune and Fraser [2010]. The description of these corpora is given inTable 9.


TM-LEGAL all entries 372k 9.8M 8.7MAcquis all entries 2.6M 64M 59MECB all entries 110k 3.1M 2.8MLEGALpjs all entries 9.7k 272k 250k

LEGALpj0 1265 26.1k 23.6kLEGALpj1 1391 29.3k 26.9kLEGALpj2 1023 29.5k 27.6kLEGALpj3 873 28.6k 27.2kLEGALpj4 780 27.5k 25.0kLEGALpj5 641 26.8k 24.3kLEGALpj6 1031 28.6k 27.2kLEGALpj7 1369 29.5k 28.4kLEGALpj8 1406 29.5k 28.4k

WMT’12 all entries 2.0M 56.9M 54.5MNC7 158k 3.8M 3.9MEP7 1.9M 53.0M 50.5M

Table 9: Overall statistics on English–German parallel data of the Legal domain used for train-ing and testing the SMT system. Counts of source (English) words and target (German) wordsrefer to tokenized texts. Symbols k and M stand for 103 and 106, respectively.

Legal domain, English–to–French

As for the previous language pairs, most of the bilingual corpora were provided by Translated.For our translation model, we used:

• Translation Memory (TM-legal): a collection of parallel segments of the LEGAL domainbuilt during the use of the CAT tool by professional translators;

• 9 LEGAL projects (LEGALprjcts): parallel documents coming from 6 real projects; ineach document, segments are kept in their original order;



13


– 1 from PHP;





– 109 French–English corpus (109)

The WMT’12 evaluation campaign provided us some corpora which were used as out-of-domain data for our data selection algorithm. The description of these corpora is given inTable 10.


TM-LEGAL all entries 728k 18M 20MAcquis all entries 2.7M 64M 70MECB all entries 110k 3.2M 3.7MLEGALpjs all entries 9.8k 277k 306k

LEGALpj0 1283 29.1k 31.8kLEGALpj1 1390 29.3k 32.4kLEGALpj2 1029 29.7k 32.2kLEGALpj3 882 28.9k 32.5kLEGALpj4 783 28.2k 31.5kLEGALpj5 642 27.1k 31.5kLEGALpj6 1031 28.5k 33.4kLEGALpj7 1381 30.0k 33.1kLEGALpj8 1411 45.9k 47.9k

WMT’12 all entries 24M 644M 760MNC7 137k 3.4M 4.0MEP7 2M 55.7M 61.7M109 21.7M 585M 695M

Table 10: Overall statistics on English–French parallel data of the LEGAL domain used fortraining and testing the SMT system. Counts of source (English) words and target (French)words refer to tokenized texts. Symbols k and M stand for 103 and 106, respectively.

Like for the IT domain, additional monolingual data was used :

• 2 corpora from the LDC French Gigaword corpus (APW and AFP)

• 1 from the WMT’12 evaluation campaign: the News Crawled Corpus (NEWS)

These corpus are the same used for the IT domain, their description is shown in Table 6.

14


2.2 Data analysis

IT domain, English–to–Italian Table 11 shows perplexity (PP) and out-of-vocabulary rate(OOV) of IT and FLDprjcts computed on two 6gr LMs, one estimated on the whole TM targettext, the other on the same text after source-target duplicates have been removed from the TM.The LMs are smoothed via the Kneser-Ney method.

PP on %OOVTMwd TMwodITprjcts 618 575 1.93

prjct0 579 541 1.17prjct1 468 434 1.85prjct2 483 454 0.67prjct3 487 449 1.22prjct4 832 779 2.35prjct5 804 738 3.22

FLDprjct 151 143 0.55

Table 11: Perplexity (PP) and out-of-vocabulary rate (OOV) of project texts (target side) over6gr LMs estimated on TM with and without duplicates.

The removal of duplicates yields a 5-7% relative reduction of PP; hereinafter, all exper-iments will involve the TM without duplicates. Anyway, it can be noted that the PP of ITprojects is rather high, differently from what we observe for the FLDprjct. This is consistentwith the a-priori knowledge about the contents of the TM, which is known to include many realprojects committed by that customer.

PP on %OOVTMwod +S

i=0···5;i 6=j prjcti

ITprjcts 492 1.63prjct0 383 0.69prjct1 429 1.85prjct2 441 0.66prjct3 315 0.92prjct4 688 2.28prjct5 562 2.17

FLDprjct 142 0.55

Table 12: PP and OOV of project texts (target side) over 6gr LMs estimated on the union of theTM without duplicates and projects, computed via cross-validation.

In order to verify if IT projects are somehow linguistically related among each other, the

15


PP/OOV of each of them has been computed over the TM and the other five projects, followinga cross-validation scheme. Values are reported in Table 12; the overall PP/OOV of the unionof IT projects and PP/OOV of the FLDprjct on the concatenation of TM and all IT projects areprovided as well. A 14-15% relative reduction of both PP and OOV indicates that IT projectsare similar to each other, while FLDprjct seems far from them as its PP/OOV values do notchange.

PP on %OOV PP on %OOVcloseTMwod farTMwodITprjcts 469 2.79 1031 2.78

prjct0 436 1.74 1117 2.56prjct1 352 2.41 662 2.32prjct2 408 3.23 949 1.44prjct3 360 1.69 891 2.55prjct4 604 2.89 1413 3.25prjct5 606 3.82 1218 4.03

FLDprjct 305 2.28 152 0.57

Table 13: PP and OOV of project texts (target side) over 6gr LMs estimated on a partition ofthe TM without duplicates into close to/far from the projects. The close portion includes5.0M words, the far portion the remaining 22.8M.

Finally, TM segments have been sorted according to their closeness to (source side of) ITprojects via the data selection method described in Section 3.1. Then the TM has been split intotwo parts, one including the closest segments to IT projects for a total of 5 million words, theother including the remaining segments. The usual target LMs have been estimated on such apartition; PP/OOV values of IT projects are shown in Table 13. It is evident that a significantportion of the TM is quite close to IT projects: by properly selecting 5M words, the PP of thewhole TM is reduced by 18% relative (from 575 to 469). The rest of the text is important atleast for lexical coverage, as evidenced by the OOV increase, from 1.93% to 2.79%, when it isexcluded. On the other hand, the partition does not well suit the FLDprjct, as expected.

3 Adaptation Methods

In this section we describe all research performed on new techniques to adapt SMT systems.Many of these techniques were already used in the lab tests. We also report ongoing research ontechniques which will be integrated in the second year of the project, in particular continuousspace methods, on-line adaptation and fast incremental update.

Parallel and monolingual data generally comes from various sources such as the Internet,

16


reports from international organizations (e.g. UN), newspapers or can be explicitly created fora specific project. This data is very heterogeneous with respect to many factors:

• their size: from several million words for manually created data up to several billion whenautomatically extracted from the Internet;

• their quality: high quality human translations versus automatically processed data fromcomparable corpora or noisy data coming from forums;

• the relevance and adequacy with respect to the targeted task, this concerns:

– the domain: news, sport, medical texts, scientific texts, etc.

– the period: texts produced in the same period are generally more relevant than others(especially for news).

Despite all those differences, it is common practice to merge all these data sources to buildthe statistical models. Three issues arise from this:

• the models do not take into account the specific characteristics of each source of informa-tion;

• the models do not represent well the target task because in-domain data is available in asmaller amounts;

• the models might be unnecessarily large since large amounts of data are injected, but apart of it may be actually irrelevant.

We have performed research on data selection and weighting during the training of statisticalmodels for machine translation. This is described in the following sections.

3.1 Mono- and bilingual data selection

It was believed for a long time that adding more data always improves the statistical models,in particular back-off n-gram language models. This has led to research on how to estimatemodels on hundreds of billions of words and how to integrate them into an SMT system, forinstance Brants et al. [2007]. More recently, it was discovered that we can actually do better withless data, in particular when building models for specific domains. We have explored differentdirections to select the most appropriate data among all the available sources. We found theidea proposed in Moore and Lewis [2010] to be the best performing approach to select data forlanguage modeling. We have reimplemented this technique and tested it for several tasks. This

17


tool was shared amoung the partners. It always led to better language models. In particular, itwas used in systems build by LIUM for the OpenMT and WMT 2012 evaluations.

The starting point of this approach is a corpus which corresponds well to the task, furthercalled “in-domain corpus”, and another generic corpus, further called “out-of-domain corpus”,which is in general much larger. The algorithm aims at extracting a sub-part of the genericcorpus which is closer to the in-domain corpus. The first step consists in creating two languagemodels (one in-domain and one out-of-domain) which will be used to compute a score for eachsentence of the out-of-domain corpus. This score is the difference between the cross-entropycalculated with the in-domain LM and the cross-entropy calculated with the out-of-domainLM. Then, the sentences are ordered according to these scores, and those with a score under acertain threshold are kept. In order to estimate this threshold, several LMs are created by usinga certain amount of data, and their perplexities are calculated on the development corpus. Themodel achieving the lowest perplexity is preserved.

In all our experiments, we have observed exactly the same behavior which was reported bythe authors: the perplexity decreases when less, but more appropriate data is used, reaching aminimum using about 10 to 20% of the data. As an side effect, the models are considerablysmaller which is an important aspect when deploying SMT systems in real applications.

This idea was extended to select appropriate data in a parallel corpus [Axelrod et al., 2011],but it did not achieve better results in our experiments. So, for parallel texts, the selection isdone by considering only the target side of the bitext.

3.2 Weighting parallel data (off-line)

Data selection is an effective method and it has consistently improved the translation model ofour SMT systems when applied to the parallel data. It is particularly efficient to select appro-priate sub-parts in large generic corpora like the one from the United Nations. Data selection,however, performs a binary decision: the data is kept or discarded. In addition, once we haveselected the most relevant part of a large generic corpus, we should not necessarily give it thesame weight as the in-domain corpus. In language modeling, it is common practice to buildmodels on the different corpora and to perform a weighted merge using an EM procedure tooptimize the coefficients. A similar technique is less straight-forward for the translation model,i.e. the phrase-table.

We have investigated several techniques to perform such a weighting, using weights at thecorpus and sentence level. In the first method described below, the weighting is performedduring the estimation of the translation model probabilities of the phrase table. We call this off-line adaptation since it is performed before the model is used, in contrast to on-line adaptationwhich allows to change the mixture model almost instantly, without the creation of a completely

18


new phrase-table.During SMT model training, the probability of a target phrase tij given a source phrase si is

computed as a simple relative frequency of word sequence occurrences as follow:

P (tij|si) =Count(si, tij)X

k

Count(si, tik)(1)

where Count(s, t) is the number of occurrences where source phrase s is aligned to targetphrase t. This formulation raises a major problem: if out-of-domain data is predominant, thenthe probability will be more impacted by this data than the in-domain data, leading to a lessrepresentative model.

In order to overcome this problem, a corpus and sentence level weighting scheme has beendeveloped [Shah et al., 2012]. The corpus weight is intended to give more importance to the in-domain corpora. However, some sentences in an in-domain corpus might be less relevant, andothers coming from an out-of-domain corpus might be of a greater importance. The sentencelevel weighting can manage the intrinsic heterogeneity of the corpora by taking into accountsome goodness scores reflecting some characteristics of the sentence. For example, the per-plexity of an in-domain language model for a sentence is a clue about the relevance of thissentence to the task. Another example is the time distance. It has been shown that data gen-erated in the same period than the targeted task are more relevant. Consequently, it is possible(given that the information is provided) to associate to each sentence a feature corresponding tothe discretized time distance between this sentence and the targeted task.

The equation 1 can then be modified as follows;

P (tij|si) =

CX

c=1

(wcCountc(si, tij) ·

SY

s=1

h

�sc,s(si, tij)

)

CX

c=1

wc

X

k

(Countc(si, tik) ·

SY

s=1

h

�sc,s(si, tik)

) (2)

with wc being the weight of the c

th corpus, and �s an additional parameter to weight thedifferent sentence goodness scores hc,s(.) among each other. In this way, any useful informationcan be used to weight the impact of each instance of a phrase-pair to the final probability. Notethat only the corpus weighting scheme was used during the field tests.

In order to benefit from this scheme, good corpus weights must be calculated. A straightforward way is to consider only the source (or target) side of each parallel corpus, to buildan LM for each one and to calculate the optimal interpolation coefficients using the usual EMprocedure to minimize the perplexity. These coefficients are then used to weight the parallel

19


corpora. Sentence level goodness functions can be based on the alignment quality calculated byGiza, a topic closeness score, the closeness of the sentences to a time period of interest, etc.

3.3 Weighting parallel data (on-line)

The techniques described in the previous section, as well as similar approaches proposed inthe literature, are performed during the creation of the phrase-table. This means that we can’tchange the weight of a particular corpus or sentence without building again the full model.This prevents us from quickly adapting a system. Therefore we have performed research onnew implementations which operate on phrase tables with sufficient statistics so that the ac-tual calculation and weighting of the probability distributions can be performed in real-time.This allows us to instantly change the weighting of the sub-models. This functionality is veryimportant for the MateCat project.

3.4 Continuous space methods

Since more two decades, back-off n-gram language models are the de-facto standard in languagemodeling for automatic speech recognition and SMT. However, the main drawback of back-offn-gram language models is the fact that the probabilities are estimated in a discrete space.This prevents any kind of interpolation in order to estimate the LM probability of an n-gramwhich was not observed in the training data. In order to attack this problem, it was proposed toproject the words onto a continuous space and to perform the estimation task in this space. Theprojection as well as the estimation can be jointly performed by a multi-layer neural network[Bengio and Ducharme, 2001]. H. Schwenk from LIUM, is working since several years oncontinuous space language models (CSLM) for large vocabulary speech recognition [Schwenkand Gauvain, 2002, Schwenk, 2007] and statistical machine translation [Schwenk et al., 2006,Schwenk, 2010]. During the last years, continuous space language have evolved to be a verypopular approach and many variants and improvements have been proposed.

LIUM is also continuing to work on continuous space methods. In particular, we havereleased a new version of the CSLM toolkit which integrates all the functionality to apply it tovery large SMT tasks, like the OpenMT or WMT evaluations. This new version also includessupport for graphical cards from Nvidia for high performance scientific calculation [Schwenket al., 2012]. By these means, we are able to train the models on billions of words in less than24 hours. The CSLM toolkit is freely available.2

Given the success of continuous space methods for language modeling there have also beenattempts to apply similar ideas to the translation model, namely Schwenk et al. [2007], and

2http://www-lium.univ-lemans.fr/˜cslm

20


more recently Le et al. [2012]. Both approaches had been integrated into so called bilingualtuple-based SMT systems. We have developed a new continuous space translation model thatsmoothly integrates into a standard phrase-based SMT system, in particular Moses. It is trainedon exactly the same data and no additional word alignments, segmentation, etc is necessary. Inmachine learning, we are generally not interested in perfectly memorizing the training exam-ples, but in learning the underlying structure of the data, and in being able to generalize well tounseen events. We have experimental evidence that the new continuous space translation modelcan provide meaningful probability estimations for new phrase pairs which were not seen inthe training data. The architecture can be also used to provide a list of the most likely transla-tions given (an unseen) source phrase. This work will be presented at the very selective Colingconference in December 2012 [Schwenk, 2012]. This continuous space translation model isimplemented on top of the CSLM toolkit and will be freely available. The estimation of thetranslation probabilities in a continuous space opens also the way to new adaptation techniques.This will be investigated in the second year of MateCat.

3.5 Fast incremental update

Creating a SMT system usually involves several steps, like word alignment, extraction of phasepairs, estimation of the probability distributions and creation of the statistical models. Thisoverall process can take several hours or even days in function of the size of the available data.Such a long training time is not a problem per se when creating initial models for a particulardomain, but this is clearly not appropriate to perform frequent adaptation to the project3 or evenreal-time reaction to user corrections.

Therefore, we have performed research on how to speed-up the process of building statisticalmodels once new data is available, without running again the whole training pipeline. Themain idea of this approach is to calculate the word-to-word alignments between the currenttranslation hypothesis and the newly available reference translation or user correction. Thesealignments are combined with the source-to-hypothesis alignments, provided by Moses duringthe translation process, in order to infer source-to-reference alignments. This process is veryfast, in contrast to the standard GIZA approach, even when the incremental version is used. Thearchitecture of this approach is shown in Figure 1. We have also explored several options to usefore- and background phrase-tables which circumvents the time consuming creation of largephrase-tables. This work will be presented in December at the IWSLT conference [Blain et al.,2012].

3we focus at least daily model updates.

21


Figure 1: Architecture of the fast incremental adaptation method.

3.6 Fill-up for Phrase-based SMT Adaptation

Given the scarcity of parallel linguistic resources, in SMT the need of combining diverse parallelcorpora for domain-specific training is typical. The MateCat scenario is quite common: little in-domain data is available for the task, but large background models exist for the same languagepair. The fill-up technique is investigated in Bisazza et al. [2011], where it is compared tointerpolation methods. Fill-up effectively exploits background knowledge to improve modelcoverage, while preserving the more reliable information coming from the in-domain corpus.In practice, the background table is merged with the in-domain table by adding only new phrasepairs that do not appear in the in-domain table. While performing similarly to the popular log-linear and linear interpolation techniques, filled-up translation and reordering models are morecompact and easy to tune by minimum error training.

3.7 Mixture LMs

The mixture of LMs is a well established technique which consists of the convex combination ofa set of LMs; the mixture weights are often estimated by applying the EM algorithm. The mix-ture model can be used to combine one or more general (background) LMs with a (foreground)LM representing new features of the language we want to include [Federico and Bertoldi, 2004].In this case, the mixture weights can be estimated on the training data of the foreground LMby applying a cross-validation scheme that simulates the occurrence of new n-grams. Such amethod is available in the IRSTLM toolkit [Federico et al., 2008] and suits the LM adaptationneeds of the MateCat project.

22


4 Baseline Systems

In the following sections we describe the baseline systems that were constructed for the fourlanguage pairs and the two domains each. We report comparative results for the various adapta-tion techniques described in section 3. We used data selection for the translation and languagemodel in most of our systems, as well as weighting the parallel corpora. Continuous spacemethods are not yet integrated into the system since they are not yet ported to the Moses serverwhich is queried by the CAT tool.

4.1 English–to–Italian, IT domain

A baseline system has been built upon the open-source MT toolkit Moses [Koehn et al., 2007].The translation and the lexicalized reordering models are trained on the parallel training dataavailable (Table 2); a 6-gram LM smoothed through the improved Kneser-Ney technique [Chenand Goodman, 1999] is estimated on the target side via the IRSTLM toolkit [Federico et al.,2008]. The weights of the log-linear interpolation model are optimized by means of the standardMERT procedure provided within the Moses toolkit.

For the experiments, each IT project has been split into three equally sized blocks: the firstis used for data selection (applied to adapted systems only, not to baselines), the second forrunning MERT, the third for evaluation.

Summarizing, the main features of the baseline system are:

• single TM, reordering model (RM), LM estimated on TM wod

• MERT on the union of second blocks of IT projects.

Its automatic scores on the test set are provided in Table 14.

task test set baselinepair domain BLEU TER METEOR GTMen–it IT IT-tst-prjct block3 23.60 55.34 40.69 57.58

Table 14: Performance of the English–to–Italian baseline system on the third block of ITprjcts.

4.2 English–to–German, IT domain

The system was trained on all the data described in Table 9. For all systems, 3 MERT runs wereperformed, each with different random seeds, and the average of the resulting optimized weights

23


was used as well. Decoding was performed with these 4 sets of weights (3 from the differentMERT runs plus the set consisting of the average of each optimized weight). The reportedresults are the average over the 4 decodings and standard deviation between parentheses.

Data Selection

EN–DE Bilingual MonolingualIT EP7 NC7 NEWS% 1 2 1ppl 969 1218 573

Table 15: Data selection for English-German, IT domain. We give the percentage of dataselected and its perplexity on the development set.

Data selection has been performed for the language and translation model. In the case of thelater model, we performed data selection using the target side only. The out-of-domain parallelcorpora available were News-Commentary v.7 (NC7), Europarl v.7 (EP7) and the out-of-domainmonolingual corpus available is the crawled news corpus (NEWS). All theses corpora weredistributed at the 2012 Workshop on SMT. The results of the selection for the available corporaare given in Table 15. For example, 2% of the news-commentary data was retained givinga perplexity of 1218. In practice, a best percentage of 1 means that the percentage versusperplexity curve was monotonically increasing, presenting no local minimum.

The final size and perplexity of each language model are presented in Table 16, togetherwith the resulting scores of the translation quality. Using our smallest language model is notclearly better than using the one with all data, however it is clearly not worse on the test data.Thus in this task, the data selection method allowed us to reduce the LM size by a factor of 5without any performance loss for the English–German language pair.

Data LM size ppl Dev TestSelection (GB) BLEU BLEU TER METEORall 9.7 3478 23.30 (0.10) 29.24 (0.30) 56.16 (0.10) 45.80 (0.09)small2 1.7 3266 23.34 (0.12) 29.41 (0.28) 55.99 (0.15) 45.77 (0.28)small 0.15 3077 23.20 (0.09) 28.79 (0.19) 56.30 (0.09) 45.20 (0.16)

Table 16: Impact of data selection for the German target language model in the IT domain.

Corpus Weighting

In this part, we evaluate the effect of two variations of the translation model, while keeping thelanguage models as described in Table 16

24


• Concatenating in-domain data with selected out-of-domain corpora (based on the selec-tion results of Table 15).

• Corpus weighting as introduced in section 3.2. This technique modifies the relative phraseprobabilities used in the phrase table. It consists of weighting the counts of occurrencesof each phrase pair depending on the corpus in which it appears. Thus, instead of simplyconcatenating all parallel corpora used to train the phrase table (which is equivalent togive them equal weights), each corpus is assigned a (possibly) different weight. To esti-mate the corpus weights, we built a language model with each corpus, interpolated themand took the interpolation coefficient as corpus weights.

The results are presented in Table 17. The results shown are the average over the 4 decodingsand standard deviation between parentheses, which we take as an estimation of the error of themeasure. The values in bold are possibly the best one for each MT metric, translation directionand domain, if we take the error range into account.

Most values are in bold, indicating that most differences between systems lie within theerror range. However, if we consider the averages, we observe that in most cases the corpusweighting yields a small improvement. The addition of out-of-domain data also yields a smallimprovement in most cases.

Systems Dev TestEn–De IT BLEU BLEU TER METEOR GTMin-domain (in-d) only 50.31 (0.21) 23.63 (0.86) 65.62 (1.36) 43.89 (0.37) 53.07 (0.25)in-d w 50.13 (0.21) 23.75 (1.26) 65.04 (1.12) 43.73 (0.58) 53.00 (0.30)in-d + ep7.1-nc7.2 50.54 (0.06) 24.62 (0.35) 64.49 (0.44) 44.09 (0.31) 53.26 (0.15)in-d + ep7.1-nc7.2 w 50.47 (0.10) 24.20 (0.48) 64.80 (0.25) 44.49 (0.20) 53.49 (0.11)

Table 17: Impact of out-of-domain parallel data and corpus weighting to estimate phrase prob-abilities in the translation model (English–German IT domain). The selected out-of-domaincorpora are referred to as name.f where f is the percentage of data selected. A “w” at the endof the system name indicates that corpus weighting was used.

4.3 English–to–Spanish, IT and Legal domains

As for the English–Italian pair, for this language pair the baseline systems have been built usingin the straightforward manner the available training data (Tables 3 and 8). The main featuresare:

• single TM, RM, LM estimated on TM wod

• default Moses weights are used for the log-linear interpolation of models.

Automatic scores of both systems on the test sets are provided in Table 18.

25


task test set baselinepair domain BLEU TER METEOR GTMen–es IT IT-tst-prjct block2 64.06 24.67 41.08 81.61en–es Legal Legal-tst-prjct block2 33.86 49.00 25.82 68.46

Table 18: Performance of baseline SMT systems for the translation of both IT and Legal Englishdocuments into Spanish.

4.4 English–to–Italian, Legal domain

The English–to–Italian baseline system for the Legal domain is in all respects analogous to thebaselines of the English-Spanish pair, described in the previous section; training data are thosereported in Table 7. Its performance is given in Table 19.

task test set baselinepair domain BLEU TER METEOR GTMen–it Legal Legal-tst-prjct block2 32.26 49.62 48.82 63.29

Table 19: Performance of the baseline SMT system for the translation of Legal English docu-ments into Italian.

4.5 English–to–German, Legal domain

Data Selection

Like in the previous part for the English–German IT Systems, we used data selection. The out-of-domain parallel corpora available were News-Commentary v.7 (NC7), Europarl v.7 (EP7)and the out-of-domain monolingual corpus available is the crawled news corpus (NEWS). Alltheses corpora were distributed at the 2012 Workshop on SMT. The results of the selection forthe available corpora are given in Table 20. The optimal percentage for the ECB corpus is 70%with a perplexity of 184.32, but this is actually almost identical to the perplexity of the entireECB corpus (184.38).

Corpus Weighting

As for the IT domain, we used the corpus weighting approach. The results are presented inTable 21. According theses results we choose the translation model in-d + ep7.7-nc7.6 w.

26


EN–DE Bilingual MonolingualLegal ECB EP7 NC7 NEWS

% 70 7 6 3ppl 184 191 786 422

Table 20: Data selection for English-German, Legal domainn. We give the percentage of dataselected and its perplexity on the development set.

Systems Dev TestEn–De Legal BLEU BLEU TER METEOR GTMin-domain (in-d) only 59.39 (0.05) 50.74 (0.11) 37.33 (0.14) 73.60 (0.09) 74.22 (0.03)in-d w 59.41 (0.10) 50.89 (0.11) 36.93 (0.31) 73.83 (0.19) 74.25 (0.21)in-d + ep7.7-nc7.6 59.32 (0.03) 50.85 (0.10) 36.94 (0.40) 73.74 (0.27) 74.25 (0.21)in-d + ep7.7-nc7.6 w 59.41 (0.04) 51.02 (0.13) 36.85 (0.33) 73.92 (0.17) 74.29 (0.06)

Table 21: Impact of out-of-domain parallel data and corpus weighting to estimate phrase prob-abilities in the translation model (English–German Legal domain). The selected out-of-domaincorpora are referred to as name.f where f is the percentage of data selected. A “w” at the endof the system name indicates that corpus weighting was used.

4.6 English–to–French, Legal domain

The work made by the LIUM on this language pair and this domain used the same experimentalprotocol and the same approaches as described before.

Data Selection

The out-of-domain parallel corpora available were News-Commentary v.7 (NC7), Europarl v.7(EP7), En-Fr 109, United Nations (UN200X) and the out-of-domain monolingual corpora avail-able are the crawled news corpus (NEWS) and the French Gigaword (AFP and APW). Alltheses corpora were distributed at the 2012 Workshop on SMT. The results of the data selectionare given in Table 22. Again, we observe that most of the ECB corpus is relevant for this task.The APW monolingual corpus, on the other hand, was discarded since only 1% was retainedand the perplexity was still very high.

4.6.1 Corpus Weighting

The results for the corpus weighting approach (based on the Table 22) are presented in Table 23.The best system chosen was in-d + 109.22-ep7.5-nc7.6-un.15

We can observe that corpus weighting of the in-domain bitexts does not give any improve-ment in this case. The results show that a big improvement can be achieved by adding selecteddata into the model (almost +8 BLEU points).

27


EN–FR Monolingual BilingualLegal AFP APW NEWS 109 ECB EP7 NC7 UN200X

% 3 1 4 22 97 5 6 15ppl 711 984 191 56 197 219 893 123

Table 22: Data selection for English-French, Legal domain. We give the percentage of dataselected and its perplexity on the development set.

Systems Dev TestEn–Fr Legal BLEU BLEU TER METEOR GTMin-domain (in-d) only 63.46 (0.17) 51.77 (0.06) 35.80 (0.25) 75.61 (0.14) 75.53 (0.05)in-d w 63.20 (0.16) 51.31 (0.18) 36.65 (0.30) 75.23 (0.14) 75.35 (0.10)in-d + 109.22-ep7.5-nc7.6-un.15 71.26 (0.10) 58.95 (0.24) 30.70 (0.30) 80.61 (0.19) 79.46 (0.17)

Table 23: Impact of out-of-domain parallel data and corpus weighting to estimate phrase proba-bilities in the translation model. The selected out-of-domain corpora are referred to as name.f

where f is the percentage of data selected. A “w” at the end of the system name indicates thatcorpus weighting was used.

5 Adapted Systems

5.1 English–to–Italian, IT domain

Given the provided training data, the experimental outcomes reported in section 2.2 justify thefollowing architectural choice of an SMT system to be used for the translation of IT projects:

• foreground (FG) models on the closest portion of the TM and on projects

• background (BG) models on the remaining part of TM

Concerning the weights of the log-linear interpolation of models, a representative develop-ment set has to be selected for running MERT. If project(s) is (are) not large-enough to buildboth a proper development set (let us say of at least 20k words) and to train reliable FG models,the closest TM segments can be used for this purpose. Since we noted that the best ranked TMsegments are too short (1 to 5 words), the TM segments to be added to the development set canbe randomly chosen from an enlarged top ranked set of longer TM segments.

Performance of Domain/Project adapted SMT systems After the baseline (Section 4.1), asystem generally adapted to the IT projects has been built. Hereinafter, it is named “domainadapted” as it is intended to perform well on generic IT projects, generalizing the focus of theTM towards the single customer. In fact, as already mentioned in Section 2.2, we know thatmost of the documents included in the TM comes from a single customer; therefore, modelsbuilt in a straightforward manner on TM would suit texts from that customer, but generalize

28


poorly on documents from other customers. In the following, the way the six IT projects ofTable 2 have been used to setup the SMT system able to perform well on generic IT texts isexplained.

For developing the domain adapted system, the FG/BG-based scheme proposed above hasbeen followed. First, the TM has been sorted with respect to the first blocks of IT projects;then, the best ranked segments for a total amount of around 4.5M words have been used as FGdata according to the PP computed on the second blocks of IT projects; as FG data the firstblocks of IT projects have been used as well; the remaining text from TM has been used to trainBG models. The FG and BG TMs/RMs have been combined by means of the fill-up technique(Nakov, 2008; Bisazza et al. 2011). A single 6-gram LM has been trained on the full TM.MERT has been run on a development set defined over the union of the second blocks of ITprojects (about 20k words). In summary, the main features of the domain adapted system are:

domain adapted

• TM and RM: fill-up of FG and BG models

• LM on the whole TM

• MERT on the union of second blocks of IT projects

Table 24 provides automatic scores of the baseline (Section 4.1) and domain adapted sys-tems computed on the union of third blocks of IT projects. The adapted system outperforms bymore than 4 BLEU points the baseline, corresponding to a relative improvement of over 17%.The domain adapted system has been released for the first field test - IT domain, English–to–Italian direction - as reference MT system, i.e. to be employed by human translators during thefirst day.

test set: ITprjcts-blck3 BLEU TER METEOR GTM

system baseline 23.60 55.34 40.69 57.58domain adapted 27.63 53.14 44.15 59.96

Table 24: Automatic scores of SMT systems on third blocks of ITprjcts.

Since it is known that the first field test will concern a specific customer, the scheme pro-posed above will be followed for building an SMT system specifically adapted to the actualproject. The choice has been assessed on the FLDprjct provided by Translated. The projectadapted system is similar in all respects to the domain adapted system, apart the exploitationof FLDprjct data for defining the FG/BG partition of the TM and the development set used forMERT. First, FLDprjct has been split into two blocks: the first consisting of 5k source words,

29


the second with the rest. The reason for such a partition is because 5k words is expected to be thedaily productivity of a human translator, therefore it represents the amount of parallel, projectspecific data available at the end of the first day of the field test. Note that it can be assumed thatthe whole source text of the project is available in advance for development purposes. The TMhas been sorted according to the whole source FLDprjct. Then, two different SMT systems havebeen built on two different partitions of the sorted TM, one to be used only for the estimationof the weights, the other to be used for the evaluation. They share the following features:

project adapted

• TM and RM: fill-up of FG and BG models


and differ in the following:

project adapted for MERT

• DEV set: 20k words randomly chosen from the best ranked 2M words of the sortedTM and first block of FLDprjct

• FG data: 5M words from the following best ranked segments and IT projects

• BG data: last 15M word of the sorted TM


• MERT on the DEV set

project adapted for EVALUATION

• FG data: 7M words from the best ranked segments of the sorted TM, first block ofFLDprjct and IT projects

• BG data: remaining 21M words from the sorted TM

• re-used interpolation weights of the project adapted SMT system developed forMERT

Table 25 provides automatic scores of the domain adapted and Google Translate (GT) SMTsystems computed on the two blocks of the FLDprjct and on their union, and the scores of theproject adapted SMT system on the second block of the FLDprjct (the scores on the first blockare meaningless since it is included in models and used for the estimation of the interpolationweights). First of all, it can be noted that the domain adapted system performs as good asGoogle Translate. Second, that the project adapted system significantly outperforms both ofthem, proving the effectiveness of the approach.

30


FLDprjct block1 block2 wholesystem BLEU TER MTR GTM BLEU TER MTR GTM BLEU TER MTR GTMdmn adapted 38.15 44.29 52.70 64.83 44.57 39.54 58.77 70.69 42.92 40.88 57.05 69.02GT 41.31 42.04 56.51 68.39 42.32 39.58 58.41 71.67 42.04 40.27 57.87 70.73prjct adapted - - - 49.54 34.53 63.09 73.69 - - -

Table 25: Automatic scores of it–en SMT systems on FLDprjct. MTR stands for METEOR.

5.2 English–to–Italian, Legal domain and English–to–Spanish, IT/Legal

domains

After the thorough description of the work done to setup the adaptation procedure for the trans-lation of IT documents from English into Italian, the outcomes have been assessed on the tasksof translating English Legal documents into Italian, and of the translation of both IT and Legaldocuments from English into Spanish. The three project adapted systems have been built ex-ploiting corpora whose statistics are provided in Tables 3, 7 and 8; they are similar to each otherand resemble the en–it adapted system built for the IT domain. The scheme can be sketchedas follows: from the training data, the closest portion to the source side of the document to betranslated has been selected and used, together with the block1 of the document, to train FGmodels. The remaining portion of the training data is used for BG models. Translation andreordering models are built by filling-up (Section 3.6) FG with BG models; LMs are built eitheras a mixture (Section 3.7) of FG and BG or by just using the concatenation of the whole trainingtarget data, according to the outcome of a PP-based assessment.

Concerning the interpolation weights, in Section 5.1 a strategy for their estimation is de-signed. The project adapted system, whose weights were estimated by means of MERT, gotBLEU score of 49.54 BLEU on the block2 of the FLDprjct (see Table 25). If default weightsprovided by Moses are used, a score of 50.43 BLEU is obtained; in addition, we observed thatthe BLEU score on the development set obtained during MERT iterations did not improve toomuch from the initial value computed with default weights. Given such experimental evidenceand with the goal of keeping the adaptation scheme as simple as possible, we decided not to runMERT and use the default weights for the project adapted systems.

Table 26 collects the results of the three new adapted systems on block2 of the respectiveevaluation documents; for completeness, the scores concerning the IT domain/en–it pair fromTable 25 and those of baselines (Sections 4.1, 4.3 and 4.4) are also reported.

It is worth noticing that for all tasks and for all metrics the project adapted systems consis-tently outperform the baselines, proving the effectiveness of the proposed adaptation scheme.

31


task test set baseline project adaptedpair domain BLEU TER MTR GTM BLEU TER MTR GTMen–it IT FLDprjct block2 44.57 39.54 58.77 70.69 49.54 34.53 63.09 73.69en–it Legal Legal-tst-prjct block2 32.26 49.62 48.82 63.29 33.14 47.88 50.32 64.57en–es IT IT-tst-prjct block2 64.06 24.67 41.08 81.61 66.10 23.01 42.34 83.09en–es Legal Legal-tst-prjct block2 33.86 49.00 25.82 68.46 35.83 47.37 26.42 69.69

Table 26: Performance of baseline and project adapted SMT systems for the translation of Legaldocuments from English into Italian, and both IT and Legal English documents into Spanish.Results for the IT domain/en–it pair from Table 25 and of baselines (Sections 4.1, 4.3 and 4.4)are provided for completeness. MTR stands for METEOR.

5.3 English–to–German, Legal domain

In this part, we present our main experiments made on our baseline system in order to adapt itto the project. The experiments presented here are made with the English–German system onthe Legal domain.

5.3.1 Lab test results

The first experiment concerns the language model (LM). As we got a LM adapted to the domain,it is not necessarily adapted to the project. In this part, in order to simulate the project adaptationscheme, we cut the internal test set in two parts. This test set, provided by Translated, containsdata coming only from translation projects (no generic data inside). The table 27 show thecharacteristics of the corpora obtained.

Corpus # lines language # wordsPart01 4552 de 105k(prj legal dev) en 113kPart02 5227 de 145k(prj legal test) en 158k

Table 27: Size of the development and test corpora for the project adaptation experiments.

We used the LM and translation model adaptation approaches, as described before. The firstpart of our experiments concerns the data selection for LM adaptation.

Monolingual data selection

We use the corpus prj legal dev as our in-domain data in order to perform the monolingualdata selection. Table 28 shows the quantity of data selected with the cross-entropy differ-ence. The LM obtained has a perplexity of 108.393 on the corpus prj legal dev. We namedit prj legal LM.

32


Corpus name TM-LEGAL Acquis ECB DGTNA NC7 EP7 NEWSPercentage selected 18 18 15 13 7 2 1ppl 3006.04 125.69 11274.8 3146.54 38746.7 7357.68 4793.68

Table 28: Corpus selection by using the cross-entropy difference scoring and perplexities ob-tained on the prj legal dev corpus.

The results of this adaptation is presented in Table 29. As a test of stability, we performedthree optimization runs of MERT. The score is an average of these three runs and the standarddeviation is indicated between parenthesis The results of this first step allows to obtain an im-provement of more than 11 BLEU points and 9 points of TER. The Translation Model used isthe same as used into our baseline system. It is indicated as baseline in the results. We also usethis new development corpus for the tuning of our system, but it did not give any improvement.

The next part of our experiments concerns the adding of specific data into the TranslationModel.

System LM Results on prj legal testBLEU TER METEOR GTM

Baseline baseline 37.08 (0.33) 49.26 (0.22) 64.23 (0.46) 66.17 (0.38)Baseline specific 48.68 (0.06) 40.30 (0.12) 73.94 (0.10) 72.78 (0.06)Baseline + prj legal dev baseline 36.84 (0.32) 51.19 (0.75) 63.35 (0.33) 65.42 (0.43)Baseline + prj legal dev specific 48.56 (0.07) 39.96 (0.16) 74.02 (0.03) 72.88 (0.03)

Table 29: Results of our two-step adaptation. First, the LM is adapted using the monolingualdata selection method (First two lines). Then, the first part of the project data is included in thetraining data.

Using domain specific bitext

The main goal of this task is to obtain a machine translation system adapted to a domainand a task. In this point of view we decided to add the specific data into the translation model.This approach aims at adding some specific translations into the translation model. First, weonly use a specific LM adapted using the data selection method. Then some in-domain dataare injected into the system, which is retuned. The results are shown in Table 29. We can seethat adapting the LM is very important here. A gain of more than 11 BLEU points is obtainedby doing so. This can be explained by the very high perplexity obtained with the baseline LM(681.35) compared to the adapted one (361). We can observe that, by comparing lines 2 and4, adding data into the Translation Model does not improve the BLEU score, but it slightlyimprove the TER score, so it is worth using them.

In the case of our lab test, all the experiments show that the main improvement come from

33


the adaptation of the target language model. During the field test, we applied the two adaptationsschemes in order to adapt the system to the project. The target LM were adapted with the samemethod, and some specific data were added to the translation model.

34


References

Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via pseudo in-domaindata selection. In EMNLP, pages 355–362, 2011. URL http://www.aclweb.org/

anthology/D11-1033.

Yoshua Bengio and Rejean Ducharme. A neural probabilistic language model. In NIPS, vol-ume 13, pages 932–938, 2001.

Arianna Bisazza, Nick Ruiz, and Marcello Federico. Fill-up versus Interpolation Methods forPhrase-based SMT Adaptation. In International Workshop on Spoken Language Translation(IWSLT), San Francisco, CA, 2011.

Frederic Blain, Holger Schwenk, and Jean Senellart. Incremental adaptation using translationinformation and post-editing analysis. In IWSLT, 2012.

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large languagemodels in machine translation. In EMNLP, pages 858–867, 2007.

Fabienne Braune and Alexander Fraser. Improved unsupervised sentence alignment for sym-metrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Confer-ence on Computational Linguistics: Posters, COLING ’10, pages 81–89, Stroudsburg, PA,USA, 2010. Association for Computational Linguistics. URL http://dl.acm.org/

citation.cfm?id=1944566.1944576.

Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for lan-guage modeling. Computer Speech and Language, 4(13):359–393, 1999.

Marcello Federico and Nicola Bertoldi. Broadcast news lm adaptation over time. ComputerSpeech and Language, 18(4):417–435, October 2004.

Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. IRSTLM: an Open Source Toolkit forHandling Large Scale Language Models. In Proceedings of Interspeech, pages 1618–1621,Melbourne, Australia, 2008.

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open SourceToolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics Companion Volume Proceedings of the Demo andPoster Sessions, pages 177–180, Prague, Czech Republic, 2007. URL http://aclweb.

org/anthology-new/P/P07/P07-2045.pdf.

35


Hai-Son Le, Alexandre Allauzen, and Francois Yvon. Continuous space translation modelswith neural networks. In NAACL, 2012.

Robert C. Moore and William Lewis. Intelligent selection of language model training data. InACL, pages 220–224, 2010.

Holger Schwenk. Continuous space language models. Computer Speech and Language, 21:492–518, 2007.

Holger Schwenk. Continuous space language models for statistical machine translation. ThePrague Bulletin of Mathematical Linguistics, (93):137–146, 2010.

Holger Schwenk. Continuous space translation models for phrase-based statistical machinetranslation. In Coling, 2012.

Holger Schwenk and Jean-Luc Gauvain. Connectionist language modeling for large vocabularycontinuous speech recognition. In ICASSP, pages 765–768, 2002.

Holger Schwenk, Daniel Dechelotte, and Jean-Luc Gauvain. Continuous space language mod-els for statistical machine translation. In Proceedings of the COLING/ACL 2006 Main Con-ference Poster Sessions, pages 723–730, 2006.

Holger Schwenk, Marta R. Costa-jussa, and Jose A. R. Fonollosa. Smooth bilingual n-gramtranslation. In EMNLP, pages 430–438, 2007.

Holger Schwenk, Anthony Rousseau, and Mohammed Attik. Large, pruned or continuous spacelanguage models on a GPU for statistical machine translation. In NAACL-HLT workshopon the Future of Language Modeling for HLT, pages 11–19, 2012. URL http://www.

aclweb.org/anthology/W12-2702.

Kashif Shah, Loıc Barrault, and Holger Schwenk. A general framework to weight heteroge-neous parallel data for model adaptation in statistical machine translation. In AMTA, 2012.

Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, Dan Tufis,and Daniel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ lan-guages. In In Proceedings of the 5th International Conference on Language Resources andEvaluation (LREC’2006), pages 2142–2147, Genoa, Italy, 2006.

36

d1.1 - first report on self-tuning mt

Documents