francis morton tyers - rua: principal · 2016-04-28 · totes les possibles traduccions, i utilitza...

Feasible lexical selection for rule-based machine translation

Selecció lèxica factible per a la traducció automàtica basada en regles

Francis Morton Tyers

www.ua.es

www.eltallerdigital.com

Ph.D. thesisFeasible lexical selection for

rule-based machine translation

Tesi doctoralSelecció lèxica factible per a la

traducció automàtica basada en regles

Francis Morton TyersSupervised byDirigida per

Mikel L. ForcadaFelipe Sánchez-Martínez

May/Maig 2013

AgraïmentsEn primer lloc vull donar les gràcies als meus tutors de tesi, Mikel L. For-cada i Felipe Sánchez-Martínez. Sense la seua ajuda, les seues idees i laseua supervisió constant, aquesta tesi no hauria vist mai la llum del dia.Moltíssimes gràcies als dos!

També vull donar les gràcies als meus companys de treball de quan tre-ballava en l’empresa Prompsit Language Engineering. Especialment vulldonar les gràcies a Gema i a Sergio. A Gema per acollir-me els primersmesos que vaig estar a Alacant i per el suport constant durant aquests anys,i a Sergio per fer-me riure sempre i per veure els problemes des d’una altraperspectiva. Tinc molts bons records de treballar a Prompsit, sobretot perl’atmosfera cordial en que treballavem.

Tinc molts bons records també d’estar al laboratori en la segona plantadel Departament de Llenguatges i Sistemes Informàtica amb els meus com-panys de laboratori: Miquel, Xavi i Víctor.

La comunitat del projecte Apertium també m’ha oferit una ajuda in-creïble, sobre tot Jim, Kevin, and Jacob. Parlant amb ells, he aclarit iconcretitzat les meues idees sobre com s’hauria de tractar la selecció lèxica.

He treballat molt amb dos altres grups, el grup Giellatekno a la Uni-versitet i Tromsø (Linda i Trond), i l’Ofis Publik ar Brezhoneg (Fulup). Elsistema de bretó–francès que faig servir en aquesta tesi no hauria sigut maipossible sense l’ajuda de Fulup…Trugarez-vras!

He participat com a mentor en cinc edicions del Google Summer of Code;els meus estudiants han sigut una inspiració, i vull agrair-los haver-me en-senyat tant com jo les he ensenyat a ells.

Al final vull donar les gràcies també a la meua familia per haver-medonat suport al llarg dels anys.

FinançamentDone les gràcies al Ministeri de Ciència i Innovació de l’Estat espanyol pelsuport que m’han donat mitjançant el projecte TIN2009-14009-C02-01 i ala Universitat d’Alacant mitjançant el projecte GRE11-20. També agraïscal projecte de mobilitat NILS (Abel Predoc Research Grant), coordinat perla Universitat Complutense de Madrid.

i

Resum

IntroduccióLa traducció automàtica es pot definir com l’ús d’un sistema informàtic pera traduir d’una llengua origen a una altra llengua meta. Aquesta tesi secentra en la traducció de textos escrits, tot i que la traducció automàticatambé pot utilitzar-se per a traduir la parla.

Hi ha dos grans enfocaments de la traducció automàtica: la traduccióautomàtica basada en corpus i la traducció automàtica basada en regles. Ladiferència principal entre aquests dos enfocaments és el tipus de coneixementen què el sistema està basat. Els sistemes de traducció automàtica basats encorpus empren grans col·leccions de textos paral·lels, corpus paral·lels, de lesquals aprenen a realitzar noves traduccions. Un text paral·lel es defineix comun parell de texts que són traduccions mútues i que estan alineats a nivelld’oració. No és pas fàcil construir un text paral·lel a partir de dos documentsque són traduccions mútues perquè no hi ha sempre una correspondènciaexacta entre oracions.

L’altre enfocament de la traducció automàtica — els sistemes basats enregles — empra diccionaris i regles per a realitzar la traducció. Aquestsrecursos tradicionalment han sigut creats per experts que tracten de cod-ificar explícitament el procés de traducció en: diccionaris morfològics perl’anàlisi i generació de formes morfològiques; diccionaris bilingües que con-tenen traduccions entre una llengua i l’altra; regles de desambiguació lèxicai sintàctica per resoldre les ambigüitats; i un conjunt de regles de transfer-ència estructural per adaptar la sintaxi d’una llengua a l’altra. Tots aquestsrecursos poden tindre informació sintàctica, i fins i tot semàntica.

En els últims quinze anys, la major part de treballs científics s’ha centraten el camp de la traducció automàtica basada en corpus. Malgrat açò, encaraes desenvolupen sistemes de traducció automàtica basats en regles per tresraons principals:

• Per a construir un sistema de traducció automàtica basat en corpusque siga útil es necessita un corpus paral·lel amb milions de parellsd’oracions. Encara que hi ha corpus així de grans per a alguns par-ells de llengües, la gran majoria de parells de llengües no gaudeixen

iii

iv

d’aquest tipus de recurs.

• Els sistemes basats en corpus que no inclouen informació lingüísticapoden tenir un pitjor rendiment que els sistemes basats en regles quanes tracta amb llengües amb morfologia complexa, o amb parells dellngües molt divergents en la sintaxi.

• Un sistema basat en regles pot ser desenvolupat, personalitzat i depu-rat més fàcilment que un sistema basat en corpus. A més, els errorsen les traduccions solen ser més repetitius i previsibles, cosa que potfacilitar al traductor la correcció del text.

Podem definir dos usos principals de la traducció automàtica: el d’assim-ilació, que tracta d’ajudar a l’usuari a comprendre un text escrit en una llen-gua que desconeix; i el de disseminació que tracta de produir un esborranyque ha de ser revisat posteriorment per un traductor. Una traducció en uncontext d’assimilació no ha de ser per força gramaticalment ben format: éssuficient que siga comprensible. En canvi, en un context de disseminació,l’esforç que requereix revisar el text ha de ser menor que el de traduir desde zero.

La selecció lèxicaDonat un mot en la llengua origen, definim la selecció lèxica com el prob-lema de trobar, en un conjunt de possibles traduccions, la traducció mésadequada en la llengua meta. Aquest problema està relacionat amb el dela desambiguació semàntica (Ide i Veronis, 1998), però amb la diferènciade que la selecció lèxica és un problema inherentment bilingüe. L’objectiuen aquest cas és trobar la traducció més adequada, i no pas el sentit mésadequat. D’aquesta forma, no és necessari triar entre dos sentits si al finalels dos s’expressen amb la mateixa traducció. Quan volem traduir el motestació en català a portuguès, no és necessari fer cap mena de selecció lèxicaperquè els tres principals sentits tenen la mateixa traducció: estação. Tan-mateix si volem traduir-lo a l’anglès cal triar, per exemple, entre station,season i resort.

Enfocaments de la selecció lèxicaEn la traducció automàtica basada en regles, es troben diverses aproxima-cions al problema de la selecció lèxica. Hi ha, per exemple, sistemes on elmòdul de selecció lèxica consisteix en un conjunt de regles què fan servir unaanàlisi profunda de la llengua origen per decidir entre possibles traduccionsen un diccionari bilingüe (Bick, 2007; Han et al., 1996; Her et al., 1994).Aquests sistemes tenen contenen habitualment milers de regles de selecciólèxica, el desenvolupament de les quals suposa un cert esforç. Per a reduir

v

aquest esforç en la literatura s’hi poden trobar també sistemes que apre-nen regles de selecció lèxica del mateix estil a partir de corpus etiquetats(Zinovjeva, 2000; Specia et al., 2005a).

D’altra banda, hi ha sistemes que fan servir corpus en el moment de fer lestraduccions. Per exemple, el sistema de Melero et al. (2007) genera primertotes les possibles traduccions, i utilitza després un model de la llenguameta, per a escollir-ne la més probable, com també es fa en la traduccióautomàtica estadística. Dagan i Itai (1994) també usen un model estadísticde llengua de la llengua meta, però la diferència és que aprenen un modelde desambiguació en la llengua origen.

En la traducció automàtica estadística la selecció lèxica es fa mitjançantla combinació del model de traducció que proporciona les probabilitats entreles paraules o seqüències de paraules en la llengua origen i les de la llenguameta, i en el model de llengua que proporciona les probabilitats de seqüènciesde paraules en la llengua meta. També hi ha articles que tracten d’inclouremodels de desambiguació de sentits (Carpuat and Wu, 2007; Chan and Ng,2007).

Els objectius d’aquesta tesi

Aquesta tesi se centra en millorar la selecció lèxica en el sistema de tra-ducció automàtica Apertium (Appendix A; Forcada et al. (2011)). Però,les tècniques de selecció lèxica també serien valides per a altres sistemes detraducció que funcionen de manera similar. Les tècniques han de ser flex-ibles, de manera que puguen fer servir qualsevol tipus de recurs lingüísticdisponible. Han de ser eficients, és a dir que no disminuiren el rendiment delsistema actual. També, han de ser independents del parell de llengües, queforma que oferiren una millora similar a llengües amb tipologies lingüístiquesdiferents. Finalment, s’han de poder editar a mà les dades del mòdul.

AvaluacióPer avaluar els mètodes que introduïm en aquesta tesi, usem quatre sistemesde traducció automàtica disponibles dins de la plataforma de traducció au-tomàtica de codi font lliure Apertium. Els quatre sistemes tradueixen en-tre els parells de llengües: bretó–francès, macedònic–anglès, basc–castellà ianglès–castellà. Tots quatre parells relacionen llengües de diferents famílies,1i de tipologies lingüístiques diferents. S’han escollit parells de llengües quedisposen de diferents quantitats de recursos disponibles, per a poder aval-uar si els mètodes que descrivim també són útils per a parells de llengües

1Les famílies a què perteneixen les llengües dels parells són: cèltica (bretó), romànica(francès i castellà, eslava (macedònic, germànica (anglès). El basc és una llengua aïllada.

vi

amb pocs recursos disponibles. L’avaluació es fa a partir de quatre corpusparal·lels, un per a cada parell de llengües.

Es fan, per cada mètode, dues avaluacions. La primera és una avaluacióintrínseca del mòdul de selecció lèxica, per a la qual s’hi fa servir la taxad’error de selecció lèxica, una mesura que es defineix en capítol 2. La taxad’error de selecció lèxica es defineix com la proporció de vegades que elsistema escull, per a un mot ambigu en llengua origen, una traducció queno es troba en la referència en la llengua meta. La segona avaluació és unaavaluació extrínseca que mesura la qualitat de la traducció final. Per a açòfem servir la mètrica , que calcula una puntuació basada en el nivell decoincidència de segments entre una traducció feta pel sistema i una traduccióde referència.

Un mòdul per a la selecció lèxica basat en restric-cionsEn el capítol 3 es descriu un formalisme per a la selecció lèxica basat enregles de restricció de context fix i un mòdul que les processa. S’ha deciditusar context fix per raons d’eficiència, i a més, perquè s’ha demostrat queel context fix en la llengua origen pot millorar la selecció lèxica en sistemesde traducció automàtica estadística (Zens et al., 2002; Koehn et al., 2003).

Les regles d’aquest formalisme tenen dues parts, una seqüència de pa-trons en la llengua origen i una seqüència corresponent d’accions. Els patronsespecifiquen el context en la llengua font, mentre que les accions permetena l’usuari especificar seleccions lèxiques. S’ha definit un format XML per ales regles, però es podria igualment haver escrit les regles en un altre format,com per exemple Constraint Grammar (Karlsson et al., 1995). El conjuntde regles en XML no es processa de manera directa; es compil·len com untransductor d’estats finits. En aquest transductor, els símbols d’entrada for-men patrons en la llengua font, i els símbols d’eixida representen operacionsen la llengua meta. Per aplicar les regles a una oració d’entrada s’usa unalgorisme eficient de programació dinàmica. L’algorisme intenta cobrir elmàxim nombre de mots en l’oració usant les regles més llargues. La moti-vació d’usar les regles més llargues és que com més llargues siguen les regles,més context tenen i, per açò, més fiables haurien de ser.

S’ha fet un experiment per avaluar el mòdul on es comprova si és adequatper a permetre a la gent escriure regles de selecció lèxica que milloren laqualitat de traducció; la magnitud de la millora; i si afegir el mòdul implicauna reducció significativa de velocitat de traducció. Hem demanat a quatrevoluntaris, un per cada parell de llengües que passen huit hores escrivintregles. Cap dels voluntaris havia vist el sistema prèviament.

Per a cada parell de llengües, es va traduir el corpus de prova ambles regles i es va comparar amb el corpus de referència. Comparant els

vii

sistemes amb regles amb els sistemes sense regles, s’observa una millora entres dels quatre parells de llengües. Cal dir que, encara que la millora ésestadísticament significativa, no és molt gran, i la cobertura de les regles noés molt alta. El conjunt de regles amb la millor cobertura cobreix només un3% dels mots ambigus.

En les figures 1 i 2 (pàgina xi) es veuen les resultats per als quatre parells.Els sistemes sense regles es denominen Ling i amb els sistemes amb regleses denominen Hand.

Aprenentatge de regles a partir de corpus paral·lelsi monolingüesS’ha vist que regles escrites a mà per persones poden tenir un efecte positiuen la qualitat de traducció. Tot i açò, no sempre disposem de persones quepuguen escriure regles per als nostres sistemes de traducció automàtica. Sino tenim disponible mà d’obra humana, aleshores cal buscar en altres llocs.Una possible font d’informació és un corpus paral·lel, que podem veure comuna col·lecció de decisions fetes per experts sobre quina és la traducció mésadequada per a un context donat.

Però, tal i com s’ha comentat, no hi ha corpus paral·lels disponiblesper a tots els parells de llengües del món, ni molt menys. Una altra fontd’informació pot ser un corpus monolingüe. Els corpus monolingües estandisponibles per a moltes més llengües — almenys les llengües que tinguenun sistema d’escriptura. Si volem poder tractar qualsevol llengua escritai aprofitar tot el coneixement disponible, és necessari tenir un mètode quepuga usar tant els corpus monolingües com els corpus paral·lels.

En aquest capítol es descriu un mètode per aprendre regles a partir decorpus paral·lels i monolingües. El mètode es basa a comptar el nombrede vegades que es troba una traducció d’un mot ambigu en un context. Sitenim un corpus paral·lel, en podem alinear els mots i contar les freqüèn-cies de mots en la llengua origen alineats amb traduccions en context en lallengua meta. En el cas de no comptar amb un corpus paral·lel, es pot fercom Sánchez-Martínez et al. (2008), que adapten un mètode supervisat peraprendre models de desambiguació lèxica per tal de funcionar de manera nosupervisada.

El nostre mètode funciona de manera semblant. Fem servir els mòdulsexistents d’un sistema de traducció automàtica per a produir les possiblestraduccions corresponents a cada seqüència possible de seleccions lèxiquesen cadascuna de les oracions del corpus d’entrenament. Després, per cadatraducció, s’avalua amb un model estadístic de la llengua meta. La puntuacióque dóna el model es normalitza dividint la probabilitat per a cada oracióper la suma de les probabilitats de totes les possibles traduccions de l’oracióen la llengua origen.

viii

Així, cada oració en la llengua origen té associat un conjunt de possi-bles traduccions amb puntuacions normalitzades que es podrien veure comcomptes fraccionaris. S’usa el mateix algorisme que utilitzem per entrenarde manera supervisada, però, en lloc de sumar u per cada ocurrència de latraducció d’un mot en la llengua origen en context, en sumem el comptefraccionari.

Tots dos mètodes tenen el problema de que s’hi generen moltes regles,i si s’hi incloguèren totes, la qualitat de la traducció en general baixaria.Per solucionar açò, s’introdueix un llindar, que es defineix com la freqüènciade la traducció per defecte — és a dir la traducció més freqüent — en uncontext donat, dividida per la freqüència de la traducció alternativa. Elmillor valor per al llindar es calcula a partir d’un petit corpus paral·lel dedesenvolupament.

Per a avaluar el sistema, fem el mateix experiment que es va fer amb lesregles escrites a mà. Comparem els sistemes amb les regles generades a partirde corpus amb els sistemes de referència. A més, es defineixen dos sistemesde referència més, un per a cada tipus d’entrenament. Per a l’entrenamentsupervisat, el sistema de referència consisteix a triar la traducció alineadamés freqüentment (MOAT). Per a l’entrenament no supervisat, el sistemade referència consisteix en triar la seqüència de traduccions amb més prob-abilitat segons el model estadístic de la llengua meta (TLM).

Les figures 1 i 2 donen un resum dels resultats dels mètodes. El mè-tode d’aprenentatge supervisat de regles (PRul) obté millors resultats queel sistema de referència en tres dels quatre parells de llengües, mentre queel mètode d’aprenentatge no supervisat (MRul) de regles només obté mil-lors resultats en dos dels quatre parells. Cal recordar, però, que encara queel mètode no supervisat no supere els resultats del sistema de referència,aproxima la qualitat de traducció aprofitant només informació en la llenguaorigen.

Un mètode per a assignar pesos a les reglesEn el capítol 4 s’hi presenta un mètode per a aprendre regles a partir decorpus paral·lels i monolingües. Hem vist que queda molt marge per a lamillora dels resultats. En el millor cas, comparant amb l’oracle — el millorresultat que podem esperar — ens queda un 10% de seleccions equivocades,mentre que en el pitjor cas hi ha encara fins al 30% de seleccions equivocades.

Se sap que el corpus conté la informació necessària, que vol dir que elmètode de regles i llindar no en fa un bon ús. De fet, estem menyspreantinformació útil en dos passos. El primer pas és quan triem el conjunt deregles mitjançant el llindar. En la majoria de casos, per a obtenir resultatsmillors, cal fixar el llindar per tal de descartar aquelles regles que trien unatraducció que no siga, al menys, per exemple tres vegades més freqüent que la

ix

traducció per defecte. Açò significa que, tot i que una traducció alternativaes trobe en un context en què és el doble de freqüent que la traducció perdefecte, la regla corresponent no s’inclou. El segon pas és quan apliqueml’algorisme de cobertura òptima, en què es tria la combinació de regles ambcontext més llarg: i podriem estar descartant contexts més curts però mésfiables.

El llindar que es descriu al capítol 4 serveix per a triar la traducció mésprobable en un context donat. La idea és que, si no hi ha res que indique elcontrari, s’ha de triar la traducció amb més freqüència en el corpus. Per adecidir triar una traducció que no siga la més freqüent, introduim la restriccióde que s’ha de trobar θ voltes més que la traducció més freqüent.

Per intentar aprofitar la informació que tenim per a traduccions alter-natives en contexts on es troben amb una freqüència baixa, introduïm unmodel probabilístic basat en el principi d’entropia màxima. Aquest principiva ser aplicat a la selecció lèxica per traducció automàtica estadística perBerger et al. (1996). Açò implica dos canvis al mètode de regles i llindar.El primer és que cal assignar un pes a cada regla que extraiem del corpus.Aquest pes es calcula a partir del corpus per el procès d’aprenentatge.

El segon canvi es fa en l’algorisme de cobertura òptima. En lloc de triarles regles més llargues, hi apliquem totes les regles, i per a cada traduccióde cada mot ambigu en la llengua origen, sumem els pesos de cada reglaactiva que tria una traducció donada. Quan arribem a un punt en l’oracióon no n’hi ha més regles actives, tornem i triem per cada paraula ambiguala traducció que tinga més pes.

Quan s’entrena el sistema amb un corpus paral·lel (Figures 1 i 2: ME-P),en tots els parells de llengües obtenim millors resultats: una millora relativad’entre el 3% en el pitjor cas i el 33% en el millor cas. Quan entrenem ambun corpus monolingüe (Figures 1 i 2: ME-M) obtenim en la majoria de casos(excepte el parell bretó–francès) resultats pitjors que els resultats obtingutsamb el sistema de regles i llindar. En tot cas, els nous resultats equivalenals resultats que obtenim amb un model de llengua meta, i no cal tenir uncorpus de desenvolupament etiquetat per trobar el llindar.

Cal dir que aquest mètode té el desavantatge que genera moltes mésregles, la qual cosa implica una reducció del rendiment. Però en tot cas,per al parell bretó–francès, amb el qual s’obtenen els millors resultats, ladiferència no és molt gran, i tot i el major nombre de regles, el sistema potprocessar més de mil paraules per segon.

Queda, doncs, una pregunta oberta: Per què l’aprenentatge d’entropiamàxima amb un corpus monolingüe no funciona tan bé per a tots els parellsde llengües? Una possible explicació és que poden haver-hi moltes combi-nacions de pesos que maximitzen l’entropia, i aquestes combinacions podenproporcionar resultats diferents quan s’usen com a classificadors (Bergeret al., 1996). Així doncs, és possible que s’haja triat un conjunt de pe-sos que maximitze l’entropia però que no siga el millor per a la tasca de

x

classificar.

ConclusionsL’objectiu d’aquesta tesi ha sigut el d’incorporar un sistema de selecció lèxicaen un sistema de traducció automàtica basat en regles per tal de millorar laqualitat de traducció. Les contribucions principals d’aquest treball són:

• La definició d’un formalisme per a escriure regles de selecció lèxicabasat en un context fix en la llengua origen.

• La implementació eficient d’aquest formalisme mitjançant transduc-tors d’estats finits, amb un algorisme de programació dinàmica quecalcula la cobertura òptima de les regles donada una oració d’entrada.

• La definició d’un mètode general per a l’aprenentatge de regles a partirde corpus, tant paral·lels com monolingües.

• La definició d’un mètode per a assignar pesos a aquestes regles basaten el principi d’entropia màxima.

Hem mostrat que és factible millorar en poc temps — tan sols huithores — i de manera estadísticamente significativa la qualitat de traducciómitjançant regles escrites a mà.

A més, hem mostrat que és possible aprendre el mateix tipus de regles apartir de corpus paral·lels i monolingües fent servir un llindar per distingirentre regles que milloren la traducció i regles que l’empitjoren. Per a incloureuna regla que tria una traducció que no és la traducció per defecte, hemd’haver-la trobat un mínim de voltes en un context donat.

Un desavantatge de l’ús d’aquest llindar és que, tot i que una traduccióen context siga el doble de freqüent que la traducció per defecte, la regla seràdescartada. Una solució a aquest problema consisteix en emprar el principid’entropia màxima per assignar pesos a les regles.

xi

0

5

10

15

20

25

30

35

40

Ling Hand TLM MLT MRul ME−MMOAT PRul ME−P

Tax

a d’

erro

r de

sel

ecci

ó lè

xica

(LE

R, %

)

0

10

20

30

40

50


Tax

a d’

erro

r de

sel

ecci

ó lè

xica

(LE

R, %

)

Figure 1: Resum dels resultats de tax d’error de selecció lèxica per als parellsanglès–castellà (dalt) i basc–castellà (baix). Per a una explicació de Ling, vegeu elcapítol 2. Per a detalls sobre TLM (model de llengua meta), MLT (traducció mésprobable), MRul (regles no supervisades), MOAT (traducció alineada més sovint)i PRul (regles supervisades) vegeu el capítol 4. I per una detalls sobre ME-M(regles no supervisades amb pesos) i ME-P (regles supervisades amb pesos) vegeuel capítol 5.

xii

0

5

10

15

20

25

30

35

40


Tax

a d’

erro

r de

sel

ecci

ó lè

xica

(LE

R, %

)

0

10

20

30

40

50

60


Tax

a d’

erro

r de

sel

ecci

ó lè

xica

(LE

R, %

)

Figure 2: Resum dels resultats de tax d’error de selecció lèxica per als parellsmacedònic–anglès (dalt) i bretó–francès (baix). Per una explicació de Ling, vegeuel capítol 2. Per a detalls sobre TLM (model de llengua meta), MLT (traducció mésprobable), MRul (regles no supervisades), MOAT (traducció alineada més sovint) iPRul (regles supervisades) vegeu el capítol 4. I per detalls sobre ME-M (regles nosupervisades amb pesos) i ME-P (regles supervisades amb pesos) vegeu el capítol 5.

Preface

I first started working on machine translation in 2005, shortly after gradu-ating from a masters in linguistics. I was enthusiastic about the possibilitiesof machine translation between closely-related languages especially whereone of them was a marginalised language. I remain convinced to this daythat machine translation can play an important part in the preservation oflinguistic diversity.

The predominant approaches to machine translation today are corpus-based approaches, which rely on bilingual texts. However for the vast ma-jority of the world’s languages, these bilingual texts are not yet availablein sufficient quantities to make a general-purpose MT system, and for asubstantial proportion, sufficient text will never2 be available. For theselanguages, the only feasible approach is the rule-based approach.

Since I got involved in the Apertium project, I have had a hand in quitea few rule-based machine translation systems for a variety of languages,including several under-resourced languages (Macedonian, Serbo-Croatian,Afrikaans, etc.) and marginalised languages (Breton, Welsh, Aragonese,North Sámi, among others). I have also worked on corpus-based methodsfor several of these languages. My experience has shown me that the dif-ference between the effort invested in making systems for both methodsis not as large as the literature may suggest, but rather that the effort isconcentrated in different aspects of development. Setting out to build acorpus-based system will bring you face to face with problems of: languageidentification, document translation identification, format cleaning, and sen-tence segmentation and alignment; where a rule-based system will requireworking on morphological analysis, morpho-syntactic disambiguation, pars-ing and contrastive grammar.

Extensive interactions with fellow machine-translation system develop-ers have shown me that the major frustration with statistical methods is theinability to fix obvious errors in the output and deficiencies in the model,where the frustration with rule-based methods comes down to data entry,codifying existing knowledge in the form of dictionaries and rules. One areaof knowledge codification is lexical selection — how to choose the most ad-

2This is a bold assertion, and one for which I would be glad to be proven wrong.

xiii

xiv

equate translation in context for a word with more than one possible trans-lation. This is something that, in corpus-based systems, largely comes forfree in the form of correspondences between sequences (often called phrases)in the source and target language, but in rule-based systems needs to beexplicitly codified.

This thesis focusses on the development of a module for lexical selectionwhich, aside from addressing the core problem, also addresses these two frus-trations. That is, on one hand, the workings of the module are transparent,traceable and modifiable and on the other hand, the data required by themodule can be learnt automatically from the selfsame resources that areused in corpus-based systems.

Finally, everything in this thesis is released as free/open-source softwareand data. The corpora and machine translation systems used in the eval-uation are free/open source, and the module developed is also free/opensource. This guarantees the reproducibility of the results, and also allowsthe methods described here to be improved by other researchers.

Structure of the thesisThis thesis is structured in six chapters and two appendices. The list belowsummarises the content of each one.

Chapter 1 begins by giving an introduction to MT and the problem oflexical selection. It describes the approaches to lexical selection whichcan be found in the literature.

Chapter 2 describes the evaluation setting – that is the systems, corporaand metrics to be used in the remainder of the thesis.

Chapter 3 presents a new module for lexical selection based on rules mod-elled as a finite-state transducer, and an efficient algorithm for com-puting the best coverage of an input stream. It evaluates the moduleusing manually written rules.

Chapter 4 describes a general method for learning rules for the modulepresented in the previous chapter. The method can take advantageof both monolingual and parallel corpora. The monolingual trainingmethod can be considered unsupervised as it does not rely on anypreviously-annotated training data.

Chapter 5 describes a method of weighting rules based on the principle ofmaximum entropy. This method allows all of the rules to be takeninto account, without having to choose between contradictory or over-lapping rules.

xv

Chapter 6 summarises the contributions of this thesis and suggests somefuture avenues for research.

Appendix A describes the Apertium platform for rule-based machine trans-lation. Throughout the thesis this system has been used to test ap-proaches to lexical selection.

Appendix B describes the free/open-source software released as part ofthis thesis.

PublicationsSome parts of this thesis have been published in papers. Below is a list,along with the chapter(s) in which the content may be found.

• Tyers, F. M. and Alperen, M. S. (2010) “SETimes: A parallel corpusof Balkan languages”. Proceedings of the MultiLR Workshop at theLanguage Resources and Evaluation Conference, LREC2010 pp. 1–5.[Chapter 2]

• Tyers, F. M. (2010) “Rule-based Breton to French machine transla-tion”. Proceedings of the 14th Annual Conference of the EuropeanAssociation for Machine Translation, EAMT10 pp. 174–181. [Chap-ter 2]

• Wiechetek, L. and Tyers, F. M. and Omma, T. (2010) “Shooting atflies in the dark: Rule-based lexical selection for a minority languagepair”. Lecture Notes in Artificial Intelligence Volume 6233, pp. 418–429. [Chapter 2]

• Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) “Flexiblefinite-state lexical selection for rule-based machine translation”. Pro-ceedings of the 16th Annual Conference of the European Associationfor Machine Translation, EAMT12, pp. 213–220. [Chapters 3 and4]

The following papers describe machine translation systems where I havehad a substantial hand in development:

• Tyers, F. M. and Donnelly, K. (2009) “apertium-cy - a collaboratively-developed free RBMT system for Welsh to English”. The Prague Bul-letin of Mathematical Linguistics No. 91, pp. 57–66.

• Tyers, F. M. (2009) “Rule-based augmentation of training data inBreton–French statistical machine translation”. Proceedings of the 13thAnnual Conference of the European Association for Machine Transla-tion, EAMT09. pp. 213–218.

xvi

• Tyers, F. M. and Nordfalk, J. (2009) “Shallow-transfer rule-based ma-chine translation for Swedish to Danish”. Proceedings of the First In-ternational Workshop on Free/Open-Source Rule-Based Machine Trans-lation. pp. 27–33

• Ginestí-Rosell, M. and Ramírez-Sánchez, G. and Ortiz-Rojas, S. andTyers, F. M. and Forcada, M. L. (2009) “Development of a free Basqueto Spanish machine translation system”. Procesamiento de LenguajeNatural No. 43, pp. 185–197.

• Toral, A., Ginestí-Rosell, M. and Tyers, F. M. (2011) “An Italian toCatalan RBMT system reusing data from existing language pairs”.Proceedings of the Second International Workshop on Free/Open-SourceRule-Based Machine Translation. pp. 77-81

• Brandt, M. D and Loftsson, H. and Sigurþórsson, H. and Tyers, F. M.(2011) “Apertium-IceNLP: A rule-based Icelandic to English machinetranslation system”. Proceedings of the 15th Annual Conference of theEuropean Association for Machine Translation, EAMT11, pp. 217–224.

• Otte, P. and Tyers, F. M. (2011) “Rapid rule-based machine trans-lation between Dutch and Afrikaans”. Proceedings of the 15th An-nual Conference of the European Association for Machine Translation,EAMT11, pp. 153–160.

• Tyers, F. M., Washington, J. N., Salimzyanov, I., and Batalov, R.(2012) “A prototype machine translation system for Tatar and Bashkirbased on free/open-source components”. Proceedings of the TurkicLanguages Workshop at the Language Resources and Evaluation Con-ference, LREC2012, pp. 11–14.

• Susanto, R. H., Lasarati, S. D., Tyers, F. M. (2012) “Rule-based ma-chine translation between Indonesian and Malaysian”. Proceedings ofthe 3rd Workshop on South and Southeast Asian Natural Language Pro-cessing at the International Conference on Computational Linguistics,COLING2012 pp. 191–200.

And the following papers, I have published on the subject of either ma-chine translation, or natural language processing:

• Tyers, F. M. and Pienaar, J. A. (2008) “Extracting bilingual wordpairs from Wikipedia”. Proceedings of the SALTMIL Workshop atthe Language Resources and Evaluation Conference, LREC2008. pp.19–22.

xvii

• Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) “Developingprototypes for machine translation between two Sámi languages”. Pro-ceedings of the 13th Annual Conference of the European Associationfor Machine Translation, EAMT09. pp. 120–128.

• Forcada, M. L., Tyers, F. M. and Ramírez-Sánchez, G. (2009) “TheApertium machine translation platform: Five years on”. Proceedingsof the First International Workshop on Free/Open-Source Rule-BasedMachine Translation. pp. 3–10

• Mayor, A. and Tyers, F. M. (2009) “Matxin: Moving towards languageindependence”. Proceedings of the First International Workshop onFree/Open-Source Rule-Based Machine Translation. pp. 11–17

• Tyers, F. M. and Sánchez-Martínez, F. and Ortiz-Rojas, S. and For-cada, M. L. (2010) “Free/open-source resources in the Apertium plat-form for machine translation research and development”. The PragueBulletin of Mathematical Linguistics 93(1), pp. 67–76.

• Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez,G. and Tyers, F. M. (2011) “Apertium: a free/open-source platform forrule-based machine translation. Machine Translation 24(1) pp. 1–18.

• Martínez-Cortés, J. P., O’Regan, J. and Tyers, F. M. (2012) “Free/open-source shallow-transfer based machine translation for Spanish andAragonese”. Proceedings of the 8th Conference on Language Resourcesand Evaluation, LREC2012, pp. 2153–2157.

• Washington, J. N., Ipasov, M. and Tyers, F. M. (2012) “A finite-statemorphological analyser for Kyrgyz”. Proceedings of the 8th Conferenceon Language Resources and Evaluation, LREC2012, pp. 934–940.

Contents

Preface xiii

1 Introduction 11.1 Machine translation . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Approaches to machine translation . . . . . . . . . . . 21.2 Why rule-based machine translation? . . . . . . . . . . . . . . 51.3 Lexical selection . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 The size of the problem . . . . . . . . . . . . . . . . . 71.3.2 Contextual information . . . . . . . . . . . . . . . . . 81.3.3 Approaches to lexical selection . . . . . . . . . . . . . 10

1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Evaluation setting 152.1 Language pairs . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Breton–French . . . . . . . . . . . . . . . . . . . . . . 162.1.2 Macedonian–English . . . . . . . . . . . . . . . . . . . 162.1.3 Basque–Spanish . . . . . . . . . . . . . . . . . . . . . 172.1.4 English–Spanish . . . . . . . . . . . . . . . . . . . . . 17

2.2 Performance measures . . . . . . . . . . . . . . . . . . . . . . 172.2.1 Lexical-selection error rate . . . . . . . . . . . . . . . . 182.2.2 Bilingual evaluation understudy . . . . . . . . . . . . 19

2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Reference results . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Constraint-based lexical selection 253.1 Formalism for lexical selection rules . . . . . . . . . . . . . . . 253.2 Compilation and finite-state representation . . . . . . . . . . 273.3 Rule application process . . . . . . . . . . . . . . . . . . . . . 273.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xix

xx CONTENTS

4 Learning lexical-selection rules 374.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Common methodology . . . . . . . . . . . . . . . . . . . . . . 394.3 Supervised learning from a parallel corpus . . . . . . . . . . . 41

4.3.1 Word alignment . . . . . . . . . . . . . . . . . . . . . 414.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.3 Finding the rule-inclusion threshold . . . . . . . . . . 43

4.4 Unsupervised learning from monolingual corpora . . . . . . . 454.4.1 Finding the rule-inclusion threshold . . . . . . . . . . 49

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.6.1 Comparison with reference systems . . . . . . . . . . . 544.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Weighting 735.1 Maximum-entropy lexical selection . . . . . . . . . . . . . . . 74

5.1.1 Rule application . . . . . . . . . . . . . . . . . . . . . 755.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Conclusions 796.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A Apertium: free/open-source shallow-transfer MT 85A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85A.2 Translation pipeline . . . . . . . . . . . . . . . . . . . . . . . 86

A.2.1 Deformatter . . . . . . . . . . . . . . . . . . . . . . . . 86A.2.2 Morphological analyser . . . . . . . . . . . . . . . . . . 86A.2.3 Morphological disambiguation . . . . . . . . . . . . . . 87A.2.4 Pretransfer . . . . . . . . . . . . . . . . . . . . . . . . 87A.2.5 Lexical transfer . . . . . . . . . . . . . . . . . . . . . . 88A.2.6 Lexical selection . . . . . . . . . . . . . . . . . . . . . 88A.2.7 Structural transfer . . . . . . . . . . . . . . . . . . . . 88A.2.8 Morphological generator . . . . . . . . . . . . . . . . . 89A.2.9 Post-generator . . . . . . . . . . . . . . . . . . . . . . 89A.2.10 Reformatter . . . . . . . . . . . . . . . . . . . . . . . . 90

B Software released as part of this thesis 91B.1 apertium-lex-tools . . . . . . . . . . . . . . . . . . . . . . . 91

List of figures 93

List of abbreviations 99

CONTENTS xxi

Index of symbols 101

Bibliography 102

Chapter 1

Introduction

This thesis deals with a specific sub-problem of machine transla-tion, namely that of lexical selection. This introductory chaptergives an overview of machine translation, and a description of theproblems of lexical selection, along with setting out the structureof the rest of the thesis.

1.1 Machine translationMachine translation is the process of using a computer program to translatetext or speech in one natural language into another. The vast majority ofmachine translation research is directed at translating text. There are manyobstacles to overcome when attempting to perform translation programmat-ically. Arnold (2003) classifies these into four groups:

• Form does not entirely determine content. This is also called the prob-lem of ambiguity. The problem is that many sentences in natural lan-guage can have more than one interpretation, and these interpretationsmay be translated differently in different languages. For example theCatalan sentence Portaven notícies de Grècia? without further con-text could be equally well translated into English as ‘Are they bringingnews from Greece?’ and ‘Are they bringing news about Greece?’ asthe preposition de may be translated as ‘from’ or ‘of’, and may attachto either: The verb portaven, as in De Grècia portaven notícies?); orthe noun notícies, as in Notícies de Grècia portaven?.

• Content does not entirely determine form. In any given language thereis usually more than one way to communicate the same meaning forany given meaning, e.g. When do we leave?, What time do we leave?,What time do we head off?

• The same content is represented differently in different languages. Lan-guages have different ways of expressing the same meaning. For ex-

1

2 CHAPTER 1. INTRODUCTION

ample, in the phrase in English She likes elephants, She is the subjectand elephants is the object. Whereas in Catalan, the phrase would betranslated as Li agraden els elefants with the subject and object rôlesreversed.

• The translation process is difficult to describe. While translation isoften an unconscious process, we translate without reflecting on howwe translate, the machine does not have this unconsciousness, andmust be told — or must learn — exactly what operations to perform.If these operations rely on information that the machine does not have,or cannot have, then a machine translation will not be possible.

It is worth noting however that languages may share grammatical struc-tures and ambiguities,1 so that even if a sentence is ambiguous, it may notbe necessary to fully disambiguate it in order to make an adequate machinetranslation.

The uses of machine translation system can be divided in two maingroups: assimilation (or gisting), that is, to enable a user to understandwhat the text is about; and dissemination, that is, to help in the task oftranslating a text to be published. The requirements of either group ofapplications is different. Assimilation may be possible even when the text isfar from being grammatically correct; however, for dissemination, the effortneeded to correct (post-edit) the text must not be so high that it is preferableto translate it manually from scratch. Texts which are fine for disseminationmay be completely inadequate for assimilation, and vice versa.

1.1.1 Approaches to machine translationThere are two principal types of machine translation:

• Corpus-based; uses collections of previously translated sentences topropose translations of new sentences.

• Rule-based (RBMT), also called symbolic MT uses dictionaries andgrammatical rules to convert source language sentences to target-language sentences, often via an intermediate representation.

A brief overview of corpus-based MT would split it into two main sub-groups, statistical (SMT) and example based (EBMT). The basic approach2to SMT (Koehn, 2010) relies on the combination of two statistical models, atranslation model, estimated from a collection of previously translated sen-tences (a parallel corpus) and a target-language model, estimated from a col-lection of sentences in the target language. When estimating the translation

1These may be referred to as free rides.2State-of-the-art phrase-based statistical machine translation systems may include a range

of other statistical models and features.

1.1. MACHINE TRANSLATION 3

model, word translation probabilities are calculated from the coocurrence oftokens in the source and target sides of the parallel corpus. When calcu-lating the target-language model, the cooccurrences are between contiguouswords in the target-language sentences. The translation process proceedsby looking up the possible translations of words, combining the possibilitiesinto a number of translation hypotheses, and then combining the probabil-ties of the translations from the translation model and the target-languagemodel. A probability is calculated for each translation hypothesis and theone with the highest probability may be selected. The first statistical ma-chine translation systems used coocurrences of tokens (Brown et al., 1993),but newer systems can use sequences of words called phrases (Koehn et al.,2003) or hierarchical rules (Chiang, 2007).

In contrast, EBMT (Nagao, 1984; Carl and Way, 2003) can be thoughtto be translation by analogy. It still requires a parallel corpus, but instead ofassigning probabilities to words, it tries to learn by example. For example,given the Spanish–Basque sentence pairs (A la chica le gustan los gatos,Katuak neskari gustatzen zaizkio) and (A la chica le gustan los elefantes,Elefanteak neskari gustatzen zaizkio) it might produce a translation templateof (A la chica le gustan x, x neskari gustatzen zaizkio). When translatinga new sentence, the parts are looked up and substituted. How the inputsentence is segmented into parts appropriate for lookup and translation andhow the translated parts are recombined after being translated is far from asolved problem in EBMT (Carl and Way, 2003).

In practice, the lines between statistical and example-based MT are notso clear. Most example-based systems have a strong statistical component(Phillips, 2011; Dandapat et al., 2010). And some statistical systems (forexample Chiang (2007)) incorporate ‘examples’ as in the Spanish–Basqueexample above. Both rule-based and corpus-based methods have advan-tages and disadvantges. Corpus-based methods, and in particular statisticalmachine translation, may produce translations which sound more fluent, butthe meaning may be less faithfully reproduced. Rule-based systems tend toproduce translations which are less fluent, but more preserving of the sourcelanguage meaning.3

Rule-based machine translation systems use rules and dictionaries to con-vert text in one language into another. They generally work by using rulesto parse the source text into a intermediate representation, applying trans-

3This can be tested by trying out the following Basque translations on Google Translatehttp://translate.google.com/, a statistical system. The first is Hau ez da irabazle batenhistoria. ‘This is not the story of a winner’ and Hau irabazle baten historia da. ‘This is thestory of a winner’. The only difference is in the position of the word da ‘is’ and the negativeez. The result for the first is ‘This is not a history of winning.’ and the second is ‘This isthe story of a winner.’ In fact, irabazle cannot be translated as ‘winning’ either of those twosentences. The translation of irabazle as ‘winning’ may however be adequate in noun-nouncompounds such as for example zaldi irabazlea ‘winning horse’. Similar examples can befound in Chen et al. (2007).


interlingua

direct translation

anal

ysis

generation

SLtext

TLtext

deep transfer

shallow transfer

Figure 1.1: The Vauquois pyramid (Vauquois, 1968) shows the different levelsof abstraction in intermediate representation in rule-based machine translation. Atthe bottom of the pyramid is direct machine translation, and at the top interlingualmachine translation. Between these two, varying levels of transfer-based machinetranslation.

formation rules to this intermediate representation and then using a secondset of rules to generate the target language (Hutchins and Somers, 1992).Depending on the level of abstraction of the intermediate representation,a system may be referred to as a transfer-based system or an interlinguasystem. This is demonstrated by figure 1.1.

In interlingua systems, the intermediate representation is unique andentirely language independent. There are a number of benefits to this ap-proach, but also disadvantages. The benefits are that in order to add a newlanguage to an existing MT system, it is only necessary to write an analyserand generator for the new language, and not transfer rules between the newlanguage and all the existing languages. The drawbacks are that it is veryhard to define an interlingua which can truely represent all nuances of allnatural languages. Any unrestricted-domain interlingua would have to beable to represent all possible meaning in both real and imaginary worlds —a daunting prospect. In practice, interlingua systems are only used for lim-ited translation domains. One example of a successful interlingual systemis the KANT system used for translating technical manuals at Caterpillar(Mitamura et al., 1991).

There can be differences in the level of abstraction of the intermedi-ate representation used in transfer-based MT. We can distinguish two broadgroups, shallow transfer (Forcada et al., 2011), and deep transfer. In shallow-transfer MT the intermediate representation is usually either based on mor-phology or shallow syntax. In deep-transfer MT the intermediate represen-tation usually includes some kind of full syntactic, or semantic information.For translating a text for dissemination between closely-related languagessuch as Catalan and Occitan, shallow transfer is sufficient, but for trans-lating for dissemination between distantly, or unrelated languages, such as

1.2. WHY RULE-BASED MACHINE TRANSLATION? 5

SLtext

TLtextAnalysis SL IR Transfer TL IR Generation

Figure 1.2: Example of how a typical transfer-based MT system works. Thesource text is first converted into a source-language intermediate representation(SL IR), which is then converted by the transfer module into the target-languageintermediate representation (TL IR) and finally this target-language intermediaterepresentation is generated by the target-language generation module.

Basque and English, then deep transfer is necessary. For dissemination weneed to have adequate syntactic transfer, which between languages with verydifferent word order (Basque with postpositions vs. English withprepositions) is not possible to treat without a full syntactic parse.

Transfer-based MT usually works as show in Figure 1.2. The original textis first analysed and disambiguated morphologically (and in the case of deeptransfer, syntactically or even semantically) in order to obtain the sourcelanguage intermediate representation. The transfer process then convertsthis final representation (still in the source language) to a representationof the same level of abstraction in the target language. From the target-language representation, the target language is generated. In transfer-basedmachine translation, rules are written on a pair-by-pair basis and are usuallyspecific to a language pair.

1.2 Why rule-based machine translation?Corpus-based MT has been the primary research direction in the field ofmachine translation in recent years. However, RBMT systems are still beingdeveloped, and there are many successful commercial and non-commercialsystems.4 Some of the reasons for this are as follows:

• To be successful, corpus-based MT requires parallel corpora in theorder of tens of millions of words. Although for some language pairsthese exist, they only exist for a fraction of the world’s languages.

• Corpus-based systems which do not incorporate any linguistic informa-tion, can provide substantially worse performance when compared torule-based systems for language pairs involving morphologically com-plex languages (Callison-Burch et al., 2008).

• RBMT systems can be easier to develop, customise and debug thancorpus-based systems (Forcada et al., 2011).

• When building RBMT systems, linguistic knowledge for a languagepair is encoded explicitly in the form of linguistic data. This makes

4Some examples: Apertium, SYSTRAN, ProMT, Lucy Software, Gramtrans, MorfoLogic


them naturally available to build knowledge for other language pairs oreven for other human language technologies, and, conversely, linguisticknowledge from other sources may be reused to build MT systems.

What is more, pairwise parallel corpora between all the worlds lan-guages are not available and are unlikely to be available. Even betweenrelated languages such as Russian and Serbo-Croatian, commercial MT sys-tems use pivot translation (Wu and Wang, 2007; Koehn et al., 2009). Thismay introduce unnecessary ambiguity and produce worse translations than astraightforward direct translation. Consider the following example in Serbo-Croatian,5 as translated by a system pivotting through English:

Resorni ministar je navlačio ljude, kaže sejte biljku zelenu i čudoće da bude.The minister of agriculture tricks the people, he says plant thegreen herb and there will be a miracle.

The translation provided by Google Translate into Russian, along with itsapproximate translation in English is:.

Соответствующий министр оказывает на людей, говорит, чтозеленые растения цветы на завод и будет чудо.The proper minister denies to the people, says, that green ofplants of flowers to factory and there will be a miracle.

In this example, the Serbo-Croatian word biljka ‘plant, herb’ has been trans-lated as the Russian phrase на завод ‘to factory’. In English the word plant isambiguous between a living organism and a manufacturing facility, howeverin Serbo-Croatian, the word biljka only has the meaning of living organism,and in Russian, the word завод only has the meaning manufacturing facility.

This thesis looks at shallow-transfer RBMT systems, primarily for theassimilation task between unrelated or distantly-related languages. Shallow-transfer systems are usually made up of the following resources:

• Monolingual dictionaries: Used for morphological analysis and gener-ation.

• Morphological disambiguation: Used to resolve ambiguities betweenthe same surface form having more than one morphological analysis(lexical form).

• Lexical transfer: Usually a bilingual dictionary containing correspon-dences between source and target-language lexical forms.6

5From the song ‘Dve žetve godišnje’ by Atheist Rap.6The lexical form of a word is the lemma and one or more of tags representing the part

of speech and any morphological features.

1.3. LEXICAL SELECTION 7

• Structural transfer: Rules to change source-language structures intotarget-language structures.

The focus of this thesis is the addition of a new module for lexical se-lection to an existing RBMT system of this kind (Apertium: Forcada et al.(2011): Appendix A).

1.3 Lexical selectionLexical selection is the task of choosing, for a given source-language word,the most adequate translation in the target language among a known set ofalternatives. The task is related to the task of word-sense disambiguation(Ide and Véronis, 1998). The difference from word-sense disambiguation isthat lexical selection is a bilingual problem, not a monolingual problem, itsaim is to find the most adequate translation, not the most adequate sense.Thus, it is not necessary to choose between a series of fine-grained sensesif all these senses result in the same final translation, however it may benecessary to choose a different translation for the same sense, for example ina collocation. It could also be the case that a single sense could have morethan one possible translation (for example synonyms). This is demonstratedin Table 1.1. When translating from Spanish to Catalan and Portuguese, itis not necessary to distinguish between the different interpretations of theword estación. When translating to Italian it is necessary to distinguishbetween (station or resort) and (season). And to French, Romanian andBasque it is necessary to distinguish between all three. The difference frompart-of-speech tagging is that selection between different translations is notresolved by identifying their part-of-speech and, even if the most adequatelemma (with part-of-speech and other relevant grammatical information) isidentified, there is still the need to disambiguate between different possibletranslations of the same lemma.

1.3.1 The size of the problemIt is very difficult to define the size of the problem of lexical selection becauseit is dependent on the task. The amount of lexical-selection ambiguity intranslation varies depending on general variables like language pair or do-main, and specific variables like the size of the bilingual dictionary of thesystem, and the number of translation alternatives per word. However, it ispossible to give a general idea of the size of the problem, and also to givesome examples of how the problem of lexical selection effects assimilationand dissemination, the two tasks mentioned before.

7This is one possible translation of ‘resort’. Other possible translations are -tegi and -etxe,as in for example bainuetxea ‘health resort’.


Language Interpretationa b c

English station resort seasonCatalan estacióPortuguese estaçãoItalian stazione stagioneFrench {gare, station} station saisonRomanian {gară, stație} stațiune sezonBasque geltoki estazio7 urtaro

Table 1.1: Examples of translations of the Spanish word estación. In more closely-related languages, there is more overlap between sets of interpretations. More thanone sense may have one translation, and a single sense may have many translations.Translations may also represent more specific interpretations, for example Frenchand Romanian gare, gară are specifically a ‘train station’, and Basque estazio isused specifically for a ‘ski resort’.

The proprietary general-purpose wide-coverage Danish to English trans-lation system described by Bick (2007) contains a total of 107,565 SL wordsin Danish with 155,593 translations in English. Of the SL words, 26,872 —just under a quarter — have more than one translation, with an averageof 3.1 translations per ambiguous word (Bick, p.c.). The most ambiguousword sætte ‘put, set, sit, repair, …’ has 89 possible translations.

In lexical selection for dissemination purposes, the main problem for thepost-editor is replacing words. Every inadequate lexical selection means onemore word that needs to be replaced. If the translation is inadequate, then itdoes not matter how inadequate or misleading it is, as it is assumed that thepost-editor understands the source language. For assimilation purposes, theproblem is being able to understand the translation, even if the selection isinadequate, it may still be understandable, either from the context, or fromthe semantic similarity. These differences are illustrated by the Catalan–English examples in Table 1.2.

1.3.2 Contextual informationWithout trying to dwell on exactly how humans perform lexical selectionwhen translating, we can consider different types of contextual informationthat would be useful to a translator, and by extension to a machine trans-lation system.

Suppose that we have a word règim in Catalan that we want to trans-late into English, we are presented with two possible translations, ‘diet’ and‘regime’. The first information we might find useful are the surroundingwords. We refer to this as lexical information, and it may take the form


La noua nau es troba en el poble més gran de l’àrea.(1) The new industrial unit is found in the biggest town in the area.(2) The new building is found in the largest village of the area.(3) The new nave is found in the biggest town in the area.(4) The new ship is found in the oldest village of the area.… …

Table 1.2: A sentence in Catalan with some of the possible translations. In the list,the intended translation is translation (1) however, in principle all the translationsare valid. It could be the case that a sequence of lexical selections is useful fordissemination, but not for assimilation (3), and vice versa, that a sequence may beuseful for assimilation but not for dissemination (2). There is also the case where aselection is misleading, as in it makes sense in the context of the rest of the sentence,but the actual translation does not adequately preserve the meaning (4).

of collocations. If we see the words nazi, totalitari, franquista in the sur-rounding context, we may want to choose the translation ‘regime’, whereasif we see vegetarià or alimentari we may want to choose the translation‘diet’. However, clues for lexical selection do not have to be collocations ofcontiguous words. For example if we have L’exèrcit del règim or Les forcesarmades del règim, we are also likely to want to choose the ‘regime’ trans-lation although there are intervening words between the ambiguous wordrègim and the context words exèrcit and forces armades.

Morphological information. It could also be the case that morphologi-cal information is also useful in determining lexical choice. For example, inBasque, the noun pisu can be translated into English as ‘weight’, ‘floor’, or‘flat’ (in the sense of living place). When preceeded by an ordinal, for exam-ple hirugarren ‘third’, the translation of ‘floor’ is probably more adequate.

Syntactic information. The use of syntactic information, such as rela-tion between head and modifier has also often been used. Syntactic infor-mation can be used alone, as in the case of the verb hil in Basque, which istranslated into English as ‘die’ when used without nork? agreement,8 and‘kill’ when used with nork?-nor? agreement. It can also be used togetherwith lexical or morphological information. For example, given the Englishword work, if the subject of the sentence is the third-person pronoun it,or has a lemma denoting a machine, computer, car, tractor then the moreadequate translation in Catalan would be ‘funcionar’, whereas with otherpersonal pronouns it would be ‘treballar’.

8In Basque, the auxiliary verb izan, ukan can be conjugated in a number of ways: withabsolutive agreement (nor?), ergative–absolutive (nork?-nor?), ergative–absolutive–dative(nork?-nor?-nori?), and absolutive–dative (nor?-nori?).


Semantic information. In some cases, semantic information may allowbetter generalisations for making lexical choices. If words are tagged withsemantic features, or semantic rôles, these can be taken into account, forexample the word know in English can be translated as either ‘saber’ or‘conèixer’ in Catalan. If the object of the verb is inanimate, then thepreferred translation may be ‘saber’, but if it is animate then it may be‘conèixer’.

Domain information. The domain in which a translation is made alsobears very heavily on the lexical selection, for example in a text on eco-nomics, the English word bank is likely to be translated more adequately as‘banco’ in Spanish, whereas in a text on fishing, it might be more adequateto choose ‘orilla’.

1.3.3 Approaches to lexical selectionA number of papers in the literature describe systems with manually-writtenrules for performing lexical selection, these can be based on different for-malisms, e.g. lexical-functional grammar (Han et al., 1996; Her et al., 1994),or dependency grammar (Bick, 2007) but all have in common that they takeadvantage of rich source-language syntactic, and in some cases, semanticanalysis in order to support the lexical-selection task.

Neither Han et al. (1996); Her et al. (1994) nor Bick (2007) provide eval-uation of their systems, although Bick (2007) gives statistics for number ofrules (in the case of the Danish–English Dan2Eng system 17,000 rules) whicheasily dwarfs the number of source language analysis rules (approx. 7,000)and syntactic transfer rules (approx. 75). According to Bick (p.c.), the lex-ical selection/transfer component took approximately two person years tobuild. Thus, it requires substantial effort which may not be available for alllanguage pairs.

One approach to getting over the knowledge-acquisition bottleneck is totry automatically extracting, or learning the previously described rules fromcorpora. There are a number of papers which describe methods for learningrules for lexical selection from either unannotated (Yang, 1999) or annotated(Specia et al., 2005a) corpora.

Yang (1999) generates a frequency list from syntactic collocations in alarge target-language corpus. This frequency list is then used to gener-ate lexical-selection rule templates to be checked by human linguists. Themethod requires an existing machine translation system, a source-languageparser, and a large target-language corpus. The paper states that 15,300rules were generated for Chinese–English, but no quantitative evaluationwas carried out.

Another approach is to use a previously annotated corpus to learn theserules. Specia et al. (2005a) describe an experiment to learn a number of


rules in the form of a decision tree for seven ambiguous verbs in English toPortuguese machine translation from a sense-annotated corpus. A numberof features for use in the decision tree were tested, from linguistically simple‘bags of words’ to part-of-speech and syntactic relations.

Work has also been reported on using transformation-based learning(Brill, 1995) to learn rules for English to Swedish translation (Zinovjeva,2000). The idea behind transformation-based learning is to start with asimple solution to the problem — in the case of part-of-speech tagging toassign the most probable tag to each word — and then iteratively improvethe solution by learning rules from an annotated corpus. In each itera-tion the new rules are evaluated and the algorithm stops when new rulesdo not improve performance. A source-language corpus is hand-annotatedwith translation choices for each source language word, then rules based onsurface form, part-of-speech, and syntactic relations are learnt.

Yarowsky (1995) reports a method for performing bootstrapping of sensedisambiguation in the source language using a small amount of initial an-notated training data, and then an algorithm for iteratively improving thecoverage. The algorithm works by taking training examples including poly-semous words from a large monolingual corpus. From these examples a smallnumber representing each of the senses of each word is hand-chosen to makean initial hand-tagged training set. The remainder of the examples are leftuntagged. Then a supervised learning algorithm is applied to the trainingsets to learn a classifier for each polysemous word. These classifiers are thenused on the untagged examples. Those examples which have been taggedby the classifier and receive a probability for a given sense above a certainthreshold are added to the training data, and the classifier is retrained. Thealgorithm stops when the set of untagged examples remains stable. Thiscould be applied to the lexical selection problem, but was tested by Koehnand Knight (2001) who reported that it performed only slightly above thebaseline (the most-frequent ‘majority’ translation) for their test set and thatit did not seem to be appropriate for finding strong context features for lessfrequent translations when there was a strong frequent translation.

One approach to target-language-based lexical selection is reported byMelero et al. (2007). Their approach is unsupervised, making use of only asource language corpus, a part-of-speech tagger, bilingual dictionary and atarget-language model. In this system, first the list of possible translationsin a bilingual dictionary is used to create a set of candidate translationsfor each sentence. The most promising candidates are preselected by cal-culating the cooccurence probability of content words (open categories) ona target-language model. Then the highest scoring candidates are scoredon target-language models of lemmata, lemmata and parts-of-speech andparts-of-speech. Subsequently the highest-scoring candidate is chosen. Thesystem works in a similar way to a word-based statistical machine trans-lation system, translations of each source language word are looked up in


the lexicon (in this case a non-probabilistic lexicon), a list of hypotheses isgenerated and these are then ranked using a target-language model. Anevaluation is reported of 227 sentences, with improvements over the base-line (most frequent) translation in half of their experiments. The statisticalsignificance of these results was not reported.

Another approach that uses target-language information at runtime ispresented by Dagan and Itai (1994). They make use of syntactic parsersin both the source and target languages to extract syntactic tuples (e.g.verb–object and modifier–noun). After generating all the possible transla-tions for a given input sentence using an ambiguous bilingual dictionary,they extract the syntactic tuples from the target language and count thefrequency in a previously-trained target-language model of tuples. They usemaximum-likelihood estimation to calculate the probability of a given targetlanguage tuple being the translation of a given source language tuple, withan automatically determined threshold for confidence. The model does notmake a choice if the probability is below this threshold. The authors statethat they have also attempted the same system with only an unannotatedn-gram language model on the target-language side, with similarly positiveresults.

A similar method is reported by Jian et al. (1999), who also calculates anapproximate translation probability based on the order of dictionary entries.The methods described by both Dagan and Itai (1994) and Jian et al. (1999)do not use any bilingual corpus, but do require at least a source-languageparser to extract syntactic relations.

Sánchez-Martínez et al. (2007) proposes a method for lexical selectionthat uses lemma cooccurrence in the target language. The method relieson tagged and lemmatised corpora of the source and target languages, anda bilingual dictionary. From the source and target corpora, stopwords areremoved. A sliding context window is used to create a TL model of coap-pearances of lemmas and their counts. The SL corpus is then processedwith the same sliding window and each word is looked up in the bilingualdictionary to find its translation(s). The scores from the TL model are thentransferred into the SL. When performing lexical selection, for each trans-lation of each ambiguous word, the coocurrence of the source word withthe given translation and the other SL words in the window are looked up.The counts are added up for each translation, and the highest scoring oneis chosen. No evaluation is given, but according to Sánchez-Martínez et al.(p.c.) the method does not provide an improvement over the linguist-chosendictionary default translation.

In statistical machine translation, lexical selection is taken care of by acombination of the translation model, which provides probabilities of trans-lations between words or word sequences (often referred to as phrases) inthe source and target language, and the target-language model, which pro-vides probabilities of word sequences in the target language. However, there

1.4. OBJECTIVES 13

has also been work on incorporating word-sense disambiguation (WSD) intostatistical machine translation systems. First attempts to incorporate WSDmodels into SMT were unsuccessful (Carpuat and Wu, 2005). But more re-cent work, such as Carpuat and Wu (2007) and Chan and Ng (2007), showsthat WSD models can improve translation quality. The key difference inboth of these papers with respect to previous work is that they eschew man-ually defined senses and instead use the possible target-language translationsof source-language segments as the senses.

1.4 ObjectivesThe main objectives of this thesis are as follows:

• To improve the treatment of the lexical-selection problem for open-category words in shallow-transfer rule-based machine translation. Thereasoning behind focussing on open-category words is that when work-ing with machine translation for assimilation, getting adequate transla-tions of content words is often more important in generating adequateoutput than getting the adequate translations of function words.

• The techniques should be flexible, able to take advantage of any dataavailable.

• The techniques should be efficient and not reduce the performance ofthe existing MT system.

• The method should be language-independent, providing similar im-provements for languages of any typology.

• The data for the module should be able to be manually encoded ormodified.

1.5 LayoutThe remainder of the thesis is laid out as follows: Chapter 2 describes theexperimental setting, including corpora and evaluation metrics used in theremainder of the thesis. Chapter 3 describes an efficient rule-based mod-ule for lexical selection, and evaluates how manually-written rules can im-prove lexical selection performance. Chapter 4 presents a general methodfor learning lexical-selection rules with both unsupervised and supervisedtraining options. Chapter 5 describes an alternative method of lexical selec-tion which takes advantage of the rule module for matching and selectingtranslations, but uses maximum-entropy training to give different weights todifferent rules. Chapter 6 presents conclusions and possible future researchdirections.

Chapter 2

Evaluation setting

This chapter describes the training and evaluation settings used in the re-mainder of this thesis. The primary motivation behind the evaluation is thatit should be automatic, repeatable, and be performed over a test set which islarge enough to be representative. It should evaluate both performance onthe specific subtask of lexical selection, and on the whole translation task.Evaluating lexical-selection performance is an intrinsic module-based evalu-ation. It measures how well the lexical selection module disambiguates thelexical-transfer output1 as compared to a gold-standard corpus. The wholetranslation task evaluation is an extrinsic evaluation, which tests how muchthe system improves as regards final translation quality in a real system.

One of the objectives of the work in this thesis is that the lexical-selectionmodule should be as language independent as possible. To that end, thelanguage pairs tested should show as wide a variety of linguistic phenomenaas feasible. It is also important that the methods described in this thesisbe as applicable to lesser-resourced and marginalised languages, as to majorlanguages.

The chapter will begin with a short description of each of the languagepairs chosen for the evaluation. The corpora to be used for training andevaluation will subsequently be described, along with the method used forannotating them. This will be followed by a description of the automaticmetrics to be used in the evaluation, and the reference results using thesemetrics for each of the language pairs.

2.1 Language pairsEvaluation will be performed using four language pairs. These pairs havebeen selected as they include languages with different morphological com-

1The lexical-transfer output is the result of looking up the translations of the source-language lexical forms in the bilingual dictionary. This is explained in more detail in Ap-pendix A, section A.2.5.

15

16 CHAPTER 2. EVALUATION SETTING

plexity, and different amounts of resources available — although for all pairsa parallel corpus is available.

2.1.1 Breton–French

Development of the Breton–French pair has been described in two articles(Tyers, 2009b, 2010). The bilingual dictionaries were not built with poly-semy in mind from the outset, but some entries were added later to startwork on lexical selection. The version used in this thesis is SVN revision41375.2 This is the only available machine translation system between Bre-ton and French. It has been developed part-time over a number of years.

2.1.2 Macedonian–English

The Macedonian–English pair in Apertium was created specifically for thepurposes of running lexical-selection experiments. The resources reused fromother pairs were the English morphological dictionary from the Icelandic andEnglish pair (Brandt et al., 2011), the Macedonian morphological dictionaryand constraint grammar from the Macedonian and Bulgarian pair (Rangelovand Tyers, 2011), and the SETimes parallel corpus (Tyers and Alperen,2010). The work was carried out over a period of eight days, and consisted ofcreating the bilingual dictionary and transfer rules. The bilingual dictionarywas created by tagging both sides of the parallel corpus, word-aligning themwith ++ and extracting the probabilistic lexicon. Entries from thislexicon were checked manually according to frequency and included in thebilingual dictionary of the machine translation system. The most probableentry was marked as the linguistic default.3 As a result of attempting toinclude all possible translations, the average number of translations per wordis much higher than in other pairs. The transfer rules were written by handto treat translation problems in the corpus. The version of the software usedin this thesis is SVN revision 41476.4 This is the only available rule-basedmachine translation system for the Macedonian–English pair. Google offersMacedonian–English translation as part of Google Translate.5

2https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-br-fr

3A bilingual dictionary in Apertium (Appendix A) may contain more than one possibletarget-language translation for a given source-language word. Where there is more than onepossible translation, the person who is writing the dictionary chooses the most general ormost frequent translation among the set of possible translations and marks it as the linguisticdefault translation.

4https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-mk-en

5Google Translate, http://translate.google.com/

2.2. PERFORMANCE MEASURES 17

2.1.3 Basque–SpanishThe development of the Basque to Spanish pair is described in Ginestí-Rosellet al. (2009). The bilingual dictionary was taken from the free/open-sourceMatxin system (Alegria et al., 2005). Alternative translations were includedin the bilingual dictionary, but non-default translations were marked witha direction restriction.6 For lexical-selection experiments these directionrestrictions were removed. The version used in this thesis is SVN revision41387.7 This is the only rule-based machine translation system from Basqueto Spanish. Google offers Basque to Spanish as part of Google Translate.

2.1.4 English–SpanishThe English–Spanish pair was developed from a combination of the English–Catalan and Spanish–Catalan pairs over a period of around 3–5 months. Thepair has been used in other lexical selection experiments (Sánchez-Martínezet al., 2007), and contains a number of entries in the bilingual dictionarywith more than one translation. The most-frequent translation, as judgedby the language pair developer, is marked as the default translation and non-default translations were added according to their frequency. The versionused in this thesis is SVN revision 41387.8 There are many other systems fortranslation between English and Spanish, both rule-based and corpus-based.Widely used examples are Google Translate and .9

2.2 Performance measures• Lexical-selection performance. This is an intrinsic module-based

evaluation of the performance of the lexical-selection module. It mea-sures how well the lexical-selection module disambiguates the outputof the lexical-transfer module as compared to a gold-standard corpus.For this task, we define a metric, the lexical-selection error rate (),see section 2.2.1.

• Machine translation performance. This is an extrinsic evaluation,which ideally would test how much the system improves as regards anapproximate measurement of final translation quality in a real system.For this task, we use the widely-used metric (see Section 2.2).

6The bilingual dictionaries in Apertium are bidirectional. They contain entries which canbe used for translating both from language A to language B and vice versa. A directionrestriction is used to limit the entry to being used for translation in one translation direction.

7https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-eu-es

8https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-en-es

9http://www.systran.fr


This is not ideal for evaluating the task of a lexical selection module asthe performance of the module will depend greatly on (a) the coverageof the bilingual dictionaries of the RBMT system in question, and(b) the number of reference translations. It is also worth noting thatsuccessful lexical selections may not lead to successful translations dueto inadequate transfer of morphological features. For example, if wetranslate the Spanish phrase La gente le dice que no venga ‘The peopletell him not to come’, where decir can be translated as ‘say’ or ‘tell’in English, and our machine translation system generates ‘The peopletells him not to come’, it would be counted the same as if it were ‘Thepeople says him not to come’, although the lexical selection madewas more adequate. Generation errors — where a word is found inthe bilingual dictionary, but not in the target language morphologicaldictionary — may also lead to the same problem. The metric isincluded only as it is commonly used to evaluate MT systems. Othermetrics with similar performance to such as (Lavie andDenkowski, 2009) and (Doddington, 2002) are not included as wehave chosen to focus on lexical-selection performance.

The main difference between the evaluation measures is that the lexicalselection error rate zeros in on the problem of lexical selection, by restrictingthe evaluation to this feature. Other features of the MT system, such as thetransfer rules and morphological generation are not taken into account. Con-fidence intervals for both of the metrics will be calculated through the boot-strap resampling method (Efron and Tibshirani, 1994) as described by Koehn(2004) and Zhang and Vogel (2004).10 In all cases, bootstrap resampling willbe carried out for 1,000 iterations. Where the p = 0.95 confidence intervalsoverlap, we will also perform pairwise bootstrap resampling.

2.2.1 Lexical-selection error rateThe lexical-selection error rate is the fraction of times the given systemchooses a translation for a word which is not found in an annotated reference.The process uses a source-language sentence, S = (s1, s2, . . . , s|S|) and threefunctions. The first function, Ts(si), returns all possible translations of si ac-cording to the bilingual dictionary. The second function, Tt(si), returns thetranslations of si selected by the lexical-selection module: Tt(si) ⊆ Ts(si);and usually |Tt(si)| = 1. If the lexical-selection module returns more thanone translation, the first translation is selected. This is equivalent to thebehaviour of the structural-transfer module (see Appendix A). The function

10Broadly speaking, this involves iteratively taking a random sample (with replacements)from the test set and computing the score for that sample. The scores are sorted and,presuming we want the 95% confidence interval, the top- and bottom-2.5% are removed.The highest and lowest remaining are given as the interval. For more detail, refer to thereferences provided.

2.2. PERFORMANCE MEASURES 19

S estiu ser un estacióTs(si) {summer} {be} {a} {station, season}Tr(si) {summer} {be} {a} {season}Tt(si) {summer} {be} {a} {station}

Figure 2.1: The input sentence and the three sets of translations used for cal-culating the lexical-selection error rate. The source sentence S = (s1, s2, . . . , s|S|)has one ambiguous word, estació. There is one difference between the reference setTr(si) and the test set Tt(si) of translations, thus the error rate for this sentence is100%.

Tr(si) returns the set of reference translations which are acceptable for si insentence S.11 For a single sentence, we define the lexical selection error rate(LER) of that sentence as follows:

LER =

∑|S|i=1 amb(si)diff(Tr(si), Tt(si)∑|S|

i=1 amb(si)(2.1)

Where amb tests if a word is ambiguous. The function diff (Equation 2.3)states that there is a difference if the intersection between the set of refer-ence translations Tr(si) and the set of translations from the lexical selectionmodule Tt(si) is empty. Recall that that, although Tt(si) returns a set, thisset will have one member, as when the lexical-selection module returns morethan one translation the first will be selected.

amb(si) ={1 if |Ts(si)| > 10 otherwise (2.2)

diff(Tr(si), Tt(si)) =

{1 if Tr(si) ∩ Tt(si) = ∅0 otherwise (2.3)

The table in Figure 2.1 gives an overview of the inputs. In the descriptionit is assumed that the reference translation has been annotated by hand.However, hand annotation is a time consuming process, and was not possible.A description of how the reference was made is given in section 2.3.

2.2.2 Bilingual evaluation understudyThe (Bilingual Evaluation Understudy, Papineni et al. (2002)) metricis an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to

11Depending on how the reference is made, the set returned by Tr(si) may be incompletein the sense that it may not include all possible acceptable translations.


be the correspondence between a machine’s output and that of a human.The central idea is that the closer a machine translation is to a humantranslation, the better it is (Papineni et al., 2002). BLEU was one of thefirst metrics that was reported to achieve a high correlation with humanjudgements of quality (Papineni et al., 2002; Coughlin, 2003) and remainsone of the most popular automated metrics.

Scores are calculated over a whole document — a set of sentences —by comparing them with a set of reference translations. Intelligibility orgrammatical adequacy are not explicitly taken into account. BLEU is de-signed to approximate human judgement at a corpus level, and performsbadly if used to evaluate the quality of individual sentences. The metricalso does not correlate with human judgements when ranking systems basedon different technologies and is recommended for tracking improvementsin performance over different configurations of the same system (Callison-Burch et al., 2006). Additionally, Denkowski and Lavie (2012) report thatit does not detect post-edition operations that improve translation quality.

2.3 CorporaFour parallel corpora are being used for the experiments in this thesis,

• Ofis ar Brezhoneg (OAB): This parallel corpus of Breton and Frenchhas been collected specifically for lexical-selection experiments fromtranslations produced by Ofis ar Brezhoneg ‘The Office of the Bretonlanguage’.12 It contains some parallel data previously described inTyers (2009b), and also some new data. The whole corpus has beenmade available online through the collection.13

• South-East European Times (SETimes): Described in Tyers andAlperen (2010), this corpus is a multilingual corpus of the Balkanlanguages in the news domain. The Macedonian and English portionwill be used.

• Open Data Euskadi (OpenData): This is a Basque and Spanishparallel corpus made from the translation memories of the Herri Ar-duralaritzaren Euskal Erakundea ‘Basque Institute of Public Adminis-tration’.14

12In 2010 the administrative status was changed and it was renamed Ofis Publik arBrezhoneg ‘Public office of the Breton language’. The corpus was published before thisdate so we use the original name.

13http://opus.lingfil.uu.se/OfisPublik.php.14http://opendata.euskadi.net/w79-contdata/es/contenidos/ds_recursos_

linguisticos/memorias_traduccion/es_izo/memorias_traduccion_izo.html

2.3. CORPORA 21

• European Parliament Proceedings (EuroParl): Described in Koehn(2005), this is a multilingual corpus of the European Union official lan-guages. We are using the English–Spanish data from version 7 of thecorpus.

There are a number of approaches to creating evaluation corpora forlexical selection in the literature. Vickrey et al. (2005) use a parallel corpusto make annotated test and training sets for experiments in lexical selectionapplied to a simplified translation problem in SMT. They use word align-ments from ++ (Och and Ney, 2003) to annotate source language wordswith their translations from the reference translation in the parallel corpus.One disadvantage of this method is that only one translation is annotatedper source language word, meaning that accuracies may be lower becauseof missing translations — that is the system chooses a translation whichis adequate, but which is not found in the reference translation. A seconddisadvantage is that the word alignments may not be 100% reliable, whichdecreases the accuracy of the annotated corpus. An alternative method isdescribed by Zinovjeva (2000), who manually tags a selection of ambiguouswords in English sentences with their translation in Swedish. The size ofthe annotated corpus is not mentioned. Specia et al. (2005b) use a paral-lel corpus, but eschew using word alignment in favour of using a bilingualdictionary and heuristic rules based on part-of-speech and relative position.The corpus was reviewed manually before being used in further experiments(Specia et al., 2005a).

The ideal situation would be to have, as described by Zinovjeva (2000), ahand-annotated evaluation corpus for testing the performance of the lexical-selection module. That is, the output of the lexical transfer module whichhad been disambiguated by hand by one or more human annotators. As thisdid not exist for any language pair, we decided to automatically annotatea test set for each language pair using a process similar to that describedby Vickrey et al. (2005)). The decision was made to automatically annotateinstead of hand annotate for a number of reasons. Firstly the time involvedin hand-annotating approximately 30 pages of text, per language pair, wouldbe prohibitive. Secondly, linguists were not available for all of the languagepairs. The disadvantage to this method is that the set of translations for eachambiguous word may not be complete as it will only include the translationwhich is found in the corresponding sentence in the parallel corpus.

The annotation process proceeds as follows: First we word-align the cor-pus to extract a set of word alignments, which are correspondences betweenwords in sentences in the source side of the parallel corpus and those in thetarget. Any aligner may be used, but in this thesis we use ++ (Och andNey, 2003).15 Where there is more than one possible alignment, the most

15The exact configuration of ++ used is equivalent to running the M toolkit(Koehn et al., 2007) in default configuration up to step three of training.


Pair Lines Extract. train other No. amb Av. ambbr-fr 57,305 4,668 2,668 2,000 603 1.07mk-en 190,493 19,747 17,747 2,000 13,134 1.86eu-es 765,115 87,907 85,907 2,000 1,806 1.30en-es 1,467,708 312,162 310,162 2,000 2,082 1.08

Table 2.1: Statistics about the source corpora. The column other gives thenumber of sentences reserved for testing (1,000) and development (1,000). Thecolumn no. amb gives the number of unique tokens with more than one possibletranslation. The column av. amb gives the average number of translations perword. This is calculated by looking up each word in the corpus in the bilingualdictionary of the MT system and dividing the total number of translation by thenumber of words. Both av. amb and no. amb are calculated over the wholecorpus.

Pair Lines SL words TL words Amb. words % ambigbr-fr 1,000 13,854 13,878 1,163 8.39mk-en 1,000 13,441 14,228 3,872 28.80eu-es 1,000 7,967 11,476 1,360 17.07en-es 1,000 19,882 20,944 1,469 7.38

Table 2.2: Statistics about the test corpora. The columns amb. words and% ambig gives the number of word with more than one translation and the per-centage of SL words which have more than one translation respectively.

probable, that is, Viterbi alignment is chosen. We then use these alignmentsalong with the bilingual dictionary of the MT system in question to extractonly those sentences where: there is at least one ambiguous word; that am-biguous word is aligned to a single word in the target language; and the wordit is aligned to in the target language is found in the bilingual dictionary ofthe MT system. Sentences where there are no ambiguous words (approxi-mately 90%, see Table 2.1) are discarded. The source side of the extractedsentence is then passed through the lexical transfer module, which returnsall the possible translations, and for each ambiguous word, the translationis selected which is found aligned in the reference.

After this process, we selected 1,000 sentence pairs at random for testing(test), 1,000 for tuning and development (dev) and left the remainder fortraining. This is a smaller number of sentences than used in typical eval-uations (such as the WMT series, Callison-Burch et al. (2012)), but wasmotivated by the fact that the smallest corpus (for Breton–French) aftersentence discarding was only little over 4,000 sentences long. Table 2.1 givesstatistics about the size of the input corpora, and how many sentences wereleft after processing for testing, training and development. Table 2.2 gives

2.4. REFERENCE RESULTS 23

Pair Metric System Diff.Ling Oracle

br-fr (%) [54.8, 60.7] [0.0, 0.0] – (%) [14.5, 16.4] [16.7, 18.6] [2.2, 2.2]

mk-en (%) [28.8, 32.6] [0.0, 0.0] – (%) [28.6, 31.0] [30.9, 33.3] [2.3, 2.3]

eu-es (%) [43.6, 48.8] [0.0, 0.0] – (%) [10.1, 12.0] [11.5, 13.5] [1.4, 1.5]

en-es (%) [20.5, 24.9] [0.0, 0.0] – (%) [21.5, 23.4] [22.8, 24.7] [1.3, 1.3]

Table 2.3: and scores with 95% confidence intervals for the referencesystems on the test corpora. ling is the linguist-chosen defaults. For the score, we also present the difference between the linguistic defaults and the oracle.Note that the confidence intervals overlap in all pairs aside from Breton–French.

information about the test corpora.

2.4 Reference resultsWe compare our methods to the following reference (or baseline) systems:

• Linguist-chosen defaults. A bilingual dictionary in an Apertiumlanguage pair contains correspondences between lexical forms. Thedictionaries allow many lexical forms to translate to one lexical form.For example, ahizpa and arreba in Basque both translate to hermana‘sister’ in Spanish.16 But a single lexical form may not have more thanone translation without further processing. If there are many possibletranslations of a lexical form, then one must be marked as the defaulttranslation. For example, pisu in Basque translates to peso ‘weight’and piso ‘flat’ in Spanish. If the translation correspondence pisu →peso is perceived as the most frequent or the most general, it is markedas the default by the linguist who is creating the bilingual dictionary.

• Oracle. The results for the oracle system are those achieved by passingthe automatically annotated reference translation (as in section 2.3)through the rest of the modules of the MT system. This is includedto give an idea of the upper bound for the performance of the lexical-selection module. If all of the lexical choices were made in accordancewith the reference, this is the result that would be achieved.

Table 2.3 summarises the state of the art (the linguist-chosen defaults)for each of the language pairs in Apertium with respect to our two evaluation

16The difference being that ahizpa is a sister of a female, and arreba is a sister of a male.


metrics. It also shows the results for the oracle, which presents the upperbound on performance. Individual chapters may define additional referencesystems. The high error rate for the Breton–French pair may be as a resultof having the linguistic defaults tuned to a different domain than that of thecorpus.

Chapter 3

Constraint-based lexicalselection

This chapter describes and motivates a system for lexical selection in rule-based machine translation based on fixed-length constraints. We have chosento rely on fixed-length contexts instead of whole-sentence context for effi-ciency reasons. This is similar to other modules in the Apertium system.It is important that the lexical-selection module does not perform slowerthan the slowest module of the system so that it does not become a bot-tleneck. Furthermore, fixed-length source-language context has been shownto provide an improvement in lexical choice over no-context in corpus-basedmachine translation (Zens et al., 2002; Koehn et al., 2003).

The layout of this chapter is as follows: We first describe a formalismand format for writing lexical-selection rules, then we go on to describe howthese rules can be compiled into a finite-state transducer, and applied to aninput stream. We then describe an experiment where rules are written byhand to try and improve the lexical-selection performance of the referencesystems (see the previous chapter) and finish up with a discussion.

3.1 Formalism for lexical selection rulesLexical-selection rules in our formalism operate on an input sentence S =(s1, s2, . . . , sn). The sentence consists of SL lexical units1. For each lexicalunit in the SL, a function Ts(si) returns the set of possible target languagetranslations.2 A pair where |Ts(si)| > 1 is termed ambiguous.

1A lexical unit is the lemma of a word along with a sequence of tags representing thepart-of-speech and any morphological information.

2Translations are lexical units made of the target-language lemma, followed by any tagsreturned by the lexical-transfer component (see section A.2.5) A translation before beingpassed through the structural transfer stage may not be a complete lexical unit for thepurposes of target-language generation.

25

26 CHAPTER 3. CONSTRAINT-BASED LEXICAL SELECTION

c U

estació n.f.* (select, season n.*)* (skip, *)plujós adj.* (skip, *)

Table 3.1: Demonstration of a rule r in our formalism. For the ambiguous pair(estació, {station, season, resort}), the rule selects the translation ‘season’ if thepair is followed by the sequence (any lexical form, plujós adjective). The column crefers to a context, a sequence of patterns, and the column U refers to a sequenceof operations.

A rule r = (c, U) is made up of a pair of two sequences: a sequenceof patterns c = (x1, x2, . . . , x|c|) one per lexical form, and a sequence ofoperations U = (u1, u2, . . . u|c|) . A pattern may contain a lemma, a sequenceof tags, a wildcard which matches any lemma, any tag, or a combinationof the above. Patterns in the rule match source-language lexical units, andthe operations are carried out on the set of target-language translations.Each operation u = (y, d) is made up of an instruction y and a target-language pattern d of lexical forms. The possible instructions are removewhich removes a translation from the target set which matches the patternd; select which removes all translations from the target set which do notmatch the pattern d; and skip which makes no modifications.

The rules are expressed in an XML-based format because of its humanand machine readability.3 Figure 3.1 shows the rule described by Table 3.1in our XML format. The match tag defines a source-language pattern.It may contain the attributes of lemma and tags, or neither. If there isno lemma and no tags, the pattern matches any source-language lexicalunit (<match/>). Each match tag may contain either one or zero operationtags. The operation tag is either select or remove. The attribute lemmais mandatory, and the attribute tags is optional. These attributes definethe target-language pattern to be matched. If there is no operation tag, thedefault operation, skip, is assumed.

Figure 3.2 shows an example of a set of hand-written rules in the XMLformat described above for the Catalan word estació which can be translatedinto English as ‘station, season, or resort’. This figure also shows the use ofthe or tag, which allows either of the patterns contained in it to be matched.

3.2. COMPILATION AND FINITE-STATE REPRESENTATION 27

<rule><match lemma="estació" tags="n.f.*"><select lemma="season" tags="n.*"/>

</match><match/><match lemma="plujós" tags="adj.*"/>

</rule>

Figure 3.1: The rule in Table 3.1 expressed in XML format. Note that the empty<match/> tag matches any lexical unit in the source language and performs theskip operation.

3.2 Compilation and finite-state representationThe set of rules R expressed in XML is not processed directly; they arecompiled into a finite-state letter transducer (Roche and Schabes, 1997), seeFigure 3.3. In this transducer, the input symbols form patterns of source-language forms, while the output symbols represent operations on a target-language pattern. The transducer is defined as ⟨Q,L, δ, q0, qF ⟩, where Q isthe set of states , L = Σ× Γ is the alphabet of transition labels, where Σ isthe set of input symbols and Γ is the set of output symbols, δ : Q× V → Qis the transition function, q0 is the initial state (nothing matched); and qF isthe final state indicating that a complete pattern has been matched. Rulesin R are paths (sequences of transitions) from q0 to qF . Parts of these pathsmay be shared between rules.

3.3 Rule application processIn order to apply the rules on an input sentence, we use a variant of thebest coverage algorithm described by Sánchez-Martínez et al. (2009). Wetry to cover the maximum number of words of each SL sentence by using thelongest possible rules; the motivation for this is that the longer the rules, themore accurate their decisions may be expected to be because they integratemore context.

To compute the best coverage a dynamic-programming algorithm (Alg. 1)is applied, which restarts the automaton at every new word in the sentenceto be translated, and uses a set of alive states A in the automaton and amap M that, for each word in the sentence, returns the best coverage up tothat word together with its score.

3The rules may however be expressed in any format. Other systems for rule-based lexicalselection use other formats, such as relational databases (OpenLogos, Scott and Barreiro(2009)) or plain-text files (Dan2Eng, Bick (2007)).


<rule><or><match lemma="curt" tags="adj.*"/><match lemma="llarg" tags="adj.*"/>

</or><match lemma="estació"><select lemma="season"/></match>

</rule><rule><match lemma="estació"><select lemma="season"/></match><match lemma="de"/><match lemma="el" tags="det.def.*"/><or><match lemma="any" tags="n.m.*"/><match lemma="recol·lecció" tags="n.f.*"/></or>

</rule><rule><match lemma="estació"><select lemma="season"/></match><match lemma="de"/><or><match lemma="estiu"/><match lemma="tardor"/><match lemma="hivern"/><match lemma="primavera"/>

</or></rule><rule><match lemma="estació"><select lemma="season"/></match><or><match lemma="sec"/><match lemma="plujós"/><match lemma="humit"/>

</or></rule><rule><match lemma="estació"><select lemma="season"/></match><match tags="pr"/><or><match lemma="pluja"/><match lemma="núvol"/><match lemma="temperatura" tags="n.f.pl"/>

</or></rule><rule><match lemma="estació"><select lemma="resort"/></match><or><match lemma="termal"/><match lemma="balneari"/>

</or></rule><rule><match lemma="al␣costat␣de"/><match/><match lemma="estació"><remove lemma="season"/></match>

</rule>

Figure 3.2: An example of lexical-selection rules for the Catalan–English pair inthe XML format. The rules were written by hand to choose alternative translationsof the ambiguous word estació ‘station (default), season, resort’. The order of rulesis not important for their application.

3.3. RULE APPLICATION PROCESS 29

estació: select(season) de: skip() el: skip()

any: skip()

recol·lecció: skip()

: <1>

: <2>hivern: skip()

al costat de: skip() *: skip() estació: remove(season): <6>

Figure 3.3: A finite-state transducer representing some of the lexical-selectionrules described in Figure 3.2; The representation has been simplified by replacinga series of letter-transitions with a single word transition. The numerals before thefinal state are the rule identifiers, used for tracing rule application.

Algorithm 1 OC: Compute the best coverage of an inputsentence.Require: s: SL sentence to translate1: A← {q0}2: i← 13: while (i ≤(s)) do4: A′ ← ∅5: M [i]← ()6: for all q ∈ A do7: for all c ∈ Q : (∃t : δ(q, ⟨s[i] : t⟩ = c)) do8: A′ ← A′ ∪ {c}9: if c = qF then10: M [i]←(M [i],(M [i−(c)], c))11: end if12: end for13: A← A− {q}14: end for15: A← A′ ∪ {q0} /* To start a new search from the next word */16: i← i+ 117: end while18: return M [i− 1]


Algorithm 1 uses five external procedures: (S) returns the numberof SL patterns4 in the string s; (c) returns the number of words ofthe rule matched by state c; (cov, c) computes a new coverage byadding to coverage cov the rule recognised by state c; returnsa word-by-word coverage; finally, (a, b) receives two coverages andreturns the one using the least possible number of rules. If there is norule matching a particular lexical unit, this is counted as a single rule. Iftwo different coverages use the same number of rules, then the former isoverwritten.

3.4 ExperimentsOur experiment has the following objectives: The first is to test if the ruleformalism described in section 3.1 is adequate enough for people to writelexical-selection rules which improve translation quality in our evaluationsystems. The second is to determine how much improvement can be achievedwith relatively little manual work. The third is to see if the addition of alexical-selection module has a significant impact on translation speed.

3.4.1 TaskFor each of the evaluation systems (see section 2.1), we asked a volunteerwith knowledge of both languages to spend eight hours writing lexical se-lection rules. The volunteers had the following linguistic and professionalprofiles:

• en-es: Computer-science student with native-speaker knowledge ofSpanish. Fluent in English.

• eu-es: Linguist with native-speaker knowledge of German. Fluent inSpanish and Basque.

• mk-en: Computer-science student with native-speaker knowledge ofMacedonian . Fluent in English.

• br-fr: Language-board director with native-speaker knowledge ofFrench. Fluent in Breton.

None of the volunteers had previous experience with the rule formalism.Each volunteer was given: A list of the ambiguous words with their trans-lations and the bilingual-dictionary defaults marked; the source-languageside of the training corpus; the output of the MT system for the source-language side of the corpus using the bilingual-dictionary defaults; and a

4This is equivalent to the length of the rule in terms of lexical units.

3.4. EXPERIMENTS 31

Pair Rule length Total Words Coverage1 2 3 4 5 >5 test train

br-fr - 19 22 10 2 - 53 10 0.9% 0.5%mk-en 17 119 15 6 - - 156 28 3.0% 3.2%eu-es - 16 10 6 5 8 45 12 0.3% 0.3%en-es - 105 - 1 - - 106 73 1.7% 2.1%

Table 3.2: Overview of the rulesets written by the volunteers. The coveragewas calculated for both the test and training portions of the corpora. It gives apercentage of how many ambiguous lexical units had rules applied to them. Thewords column gives the number of words for which at least one disambiguation rulewas written.

short document describing the format.5 They were instructed to write asmany lexical-selection rules as possible using any dictionary or linguistic re-sources at their disposal. Table 3.2 gives an overview of the rule sets writtenby the volunteers.

The rule sets for the Breton–French, Macedonian–English and Spanish–English pairs overwhelmingly contained lexicalised rules that selected a par-ticular translation when followed by, or preceeded by a lemma (a rule lengthof 2). In seventeen cases, the rules for Macedonian–English contained nocontext, which had the effect of changing the linguistic-default translation.The bilingual dictionary in the Macedonian–English language pair containedmany synonyms, as a result of having been posteditted from the output ofa word-aligner. In this pair, the volunteer tried to find contexts to choosebetween words with very close meanings, for example big and large. Theserules were typically unsuccessful, usually due to inadequate knowledge ofthe distinctions in the target language.

For the Basque–Spanish pair, the rules were more diverse, several reliedon grammatical information, for example a rule which chose entrar ‘enter’as a translation of sartu ‘put, enter, insert, …’ if the auxiliary verb had onlyabsolutive agreement. There were also rules which attempted to imitatescanning – that is looking for a word in any position to the left or right. Forexample, one rule (Figure 3.4) chose suponer ‘suppose’ as a translation ofeman ‘give, suppose, …’ if there was a clitic conjunction -la ‘that’ six wordsto the right, with any words in between. This same rule was duplicated forfinding the conjunction one to five words to the right.

The English–Spanish pair contained a number of rules selecting the de-fault translation, and also contained rules matching ungrammatical con-structs not found in the source corpus, for example a verb with a countableobject in singular without a determiner: raise family as opposed to raise a

5http://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules; Accessed: 10th December, 2012


<rule comment="example:␣demagun␣A␣espazioko␣puntu␣bat␣dela."><match lemma="eman" tags="vblex.*"><select lemma="suponer"/></match>

<match/><match/><match/><match/><match/><match/><match lemma="la" tags="cnjsub"/>

</rule>

Figure 3.4: A hand-written rule selecting suponer as a translation of eman ‘give,suppose, …’. The rule has a six wildcards to allow intervening subordinate clauseof six words. In the Basque–Spanish rules there were seven rules of this type forthe same selection, each with a different number of <match/> tags. The examplesentence may be translated as ‘Let us suppose that A is a point in space.’, where‘that’ is translated by -la and the wildcards match the intervening clause.

family.When asked which resources they used, the Breton–French and Macedonian–

English volunteers said that made heavy use of the corpus of example sen-tences provided, whereas the Basque–Spanish and English–Spanish volun-teers preferred to work with dictionaries. In the case of the Basque–Spanishvolunteer, they did not use the corpus as it was the sentences were abovetheir linguistic level, and in the case of the English–Spanish volunteer, thecorpus was too large to know where to start.

3.4.2 ResultsTable 3.3 presents the results of the hand-written rule sets as applied to thetest corpus, with the reference results for comparison. As can be seen, for theBreton–French and Macedonian–English pairs, the increase in performancefor both and was small, although significant at p = 0.95. For theBasque–Spanish pair, the result for was significant, but the result for was not (at p = 0.95). The rule set for the English–Spanish pair hadno perceivable effect on performance.

In addition to the two automatic measures described in section 2.2, wecalculated two other quality measures: (a) naïve accuracy, a measure of howmany times a rule selected the same translation that is found in the referencetranslation; and (b) differential accuracy, a measure of how many times arule applied selected the same or an improved translation as the one in thereference.

The idea of the differential accuracy evaluation is to get an idea of howfrequently the rules written by the volunteers improved the translation, whiletaking into account that the translation in the reference might not be theonly possible adequate translation. This was evaluated manually by lookingat the sentences in the test corpus, and the sentence output by the ma-

3.4. EXPERIMENTS 33

Pair Metric SystemLing Oracle Hand

br-fr (%) [54.8, 60.7] [0.0, 0.0] [54.3, 60.2] (%) [14.5, 16.4] [16.7, 18.6] [14.6, 16.5]

mk-en (%) [28.8, 32.6] [0.0, 0.0] [26.3, 30.0] (%) [28.6, 31.0] [30.9, 33.3] [28.7, 31.1]

eu-es (%) [43.6, 48.8] [0.0, 0.0] [43.4, 48.5] (%) [10.1, 12.0] [11.5, 13.5] [10.1, 12.0]

en-es (%) [20.5, 24.9] [0.0, 0.0] [20.5, 24.9] (%) [21.5, 23.4] [22.8, 24.7] [21.5, 23.4]

Table 3.3: Results for each language pair after applying the hand-written rulesto the evaluation corpora. The columns give and scores for the threesystems: Ling (the linguistic defaults), Hand (the hand-written rules) and Oracle(the result which would be achieved if for every ambiguous word we selected theword in the reference). Both LER and BLEU scores are given as percentages.

chine translation system. As a general method of evaluation, this wouldnot be feasible, as it requires manually checking every rule-choice. Considerthe following examples for the Macedonian word голем ‘big (default), large,great’.6 In the examples below, S is the sentence in Macedonian, R is thereference translation, the D is the default translation, and the L is the trans-lation produced by the lexical-selection system.

(1) S Чувствував ужасен страв и голема љубов за градот.R I felt terrible fear of, and great love for the city.D Was feeling terrible fear and big love for the city.L Was feeling terrible fear and great love for the city.

(2) S Сè на сè, проектот има голем успех.R Overall, the project is a big success.D All of all, the project have big success.L All of all, the project have great success.

(3) S Планот предвидува инвестирања во големите градови.R The plan envisions investments in big cities.D The plan is predicting investments in the big cities.L The plan is predicting investments in the large cities.

6The translations which are found in the bilingual lexicon are: ‘big, large, great, high,major, main, huge, massive, severe, vast, significant, considerable, broad’. But for reasonsof space we include only the relevant ones.


Pair Applied Accuracy Differentialbr-fr 10 70.0% 70.0%mk-en 65 44.6% 67.69%eu-es 4 100.0% 100.0%en-es 26 92.3% 92.3%

Table 3.4: Results for each language pair after applying the hand-written rulesto the evaluation corpora. The column Applied shows how many rules were ap-plied in total, the column Accuracy shows the naïve accuracy, and the columnDifferential shows the differential accuracy.

Example (1) shows a result where a rule matches a pattern, and selectsa translation that is in the reference translation. Example (2) shows a resultwhere a rule matches a pattern and selects a translation which is as adequateas the one in the reference. Finally, example (3) shows a result where a rulematches a pattern and selects a translation which is not in the reference andis less adequate. The naïve accuracy measure would count (1) as a success,but not (2), while the differential accuracy measure would count both (1)and (2) as successes.

Table 3.4 gives the results for the two additional measures. As canbe seen from the Macedonian–English results, there can be a noticeabledifference between the results for naïve accuracy and differential accuracy.Lexical-selection rules which choose synonyms of translations which are notfound in the reference are the cause of the difference. For the Basque–Spanish pair, the figures of 100% accuracy indicate that all of the rules whichwere applied chose the translation which was in the reference. Althoughthe accuracy is high for the English–Spanish pair, there is no perceiveabledifference in performance for the two evaluation measures (see Table 3.3).This is because the majority of selections performed by the rules (21 out of26) chose the same translation as the linguistic-default translation. Noneof the rule sets for the other language pairs contained rules selecting thelinguistic-default translation.

3.5 DiscussionThis chapter has presented a formalism for writing lexical selection rules,an XML-based format for writing the rules, and an implementation basedon finite-state transducers. Up to around 1,000 rules the module has noperceivable effect on performance, that is, running the MT system with thelexical-selection module is as fast as running it without. After this, each timethe number of rules is doubled, the number of words per second processedby the system drops by on average 200. However, even with 30,000 rules the

3.5. DISCUSSION 35

SELECT ("season" n) IF (0 ("<estació>" n f))(2 ("<plujós>" adj)) ;

Figure 3.5: The rule from Figure 3.1 rewritten in Constraint Grammar (CG) for-malism. The angle brackets are used to identify the lemmas in the source language.The numerals indicate relative positions.

system is still capable of processing in the order of thousands of words persecond. The chapter has shown that time spent on manually writing rulesin this formalism can make a small, but statistically significant at p = 0.95,dent on the problem of lexical selection, but that there is a problem withcoverage. Even after a full day’s work, in the best case — the Macedonian–English pair — the coverage was just under 3% on the test corpus. Not to beunderestimated however is the utility in terms of being able to quickly andeasily fix lexical-selection errors. It is worth noting also that the rules learnthere are not restricted to Apertium, context rules based on source-languagecontext of this type could easily be recast in another formalism, for exampleConstraint Grammar (see example in Figure 3.5).

In the following chapters, we study methods to learn these same rulesautomatically without the intervention of rule writers. The objectives be-ing threefold: to be able to improve lexical-selection accuracy in rule-basedmachine translation in the absence of rule writers; to increase coverage ofthe rules; and to decrease the work for rule-writers working on rule-basedmachine translation. It is worth noting that the rule sets we learn will be inthe same formalism, and so able to be editted and refined by humans.

Chapter 4

Learning lexical-selectionrules

4.1 OverviewWe have seen in the previous chapter how human expertise and intuition canbe used to manually write lexical-selection rules incorporating context. Us-ing these context rules can provide a reduction in lexical-selection error rateover simply picking the most frequent or general translation. A drawback ofthis method is that it relies on humans with linguistic knowledge,1 who maynot always be on hand. In the absence of humans we must look elsewhere toimprove the lexical-selection performance of our machine-translation system.

One source of knowledge that can be used in lexical selection is a paral-lel corpus. This can be thought of as a collection of expert judgements2 ofhow to translate a word in language A into language B in a given context.This knowledge source is used in both example-based (Nagao, 1984; Carland Way, 2003) and statistical (Koehn, 2010) machine translation. Somespecifics of statistical machine translation have been dealt with in the intro-duction (section 1), here we briefly review how local context is used. In word-based SMT (Brown et al., 1993), local context is used only on the target-language side. The translation model provides for each source-language wordthe possible target-language transations, and the job of the target-languagemodel is, among other things, to disambiguate between them. As noted byZens et al. (2002), the language model is not always capable of doing this.

In phrase-based3 statistical machine translation (Zens et al., 2002; Koehn1Here we use linguistic knowledge to mean either knowledge of linguistics, or knowledge

of a specific language.2In most cases, parallel corpora are translations performed by translation professionals,

however this may not always be the case, the corpus collection (Tiedemann and Nygård,2004) contains corpora of subtitles and software localisation strings, which may have beentranslated by non-experts.

3The phrases in phrase-based statistical machine translation are not syntactic constituents

37

38 CHAPTER 4. LEARNING LEXICAL-SELECTION RULES

et al., 2003), context from the source language is incorporated by includingsequences of source-language words mapping to sequences of target languagewords in the translation model. Zens et al. (2002) report that by includingthis extra context they are able to improve translation quality by 1.3% .4

The downside of any method relying on a parallel corpus is that parallelcorpora are a scarce resource. Although there are parallel corpora availablefor many major languages, they by no means can be said to be availablefor all of the world’s written languages. On the other hand, monolingualcorpora may be easily constructed for any language which has a presence onthe World Wide Web (Scannell, 2007).

In this chapter we present a generalised method of corpus-based lexical-selection rule extraction. The method can take advantage of both supervisedlearning (if a parallel corpus is available for the language pair) or unsuper-vised learning (if only monolingual corpora are available). In the super-vised method (section 4.3), the parallel corpus is word-aligned (4.3.1), andfrequency counts of alignments between source-language words and target-language words in context are used to extract rules.

Sánchez-Martínez et al. (2008) show that a statistical, supervised methodfor part-of-speech tagging, which relies on collecting frequency counts fromlabelled examples in a corpus, can be recast as an unsupervised learning taskby using partial counts from a set of fractional corpora. These fractional cor-pora are assumed to have been generated by processing every possible dis-ambiguation path for each sentence using the modules from a machine trans-lation system and then scoring the results on a target-language model. Theunsupervised learning method, which we present in detail in subsection 4.4,works in a similar way. The source language sentences are translated togenerate all possible translations as regards lexical choice. These are thenscored on a target-language model. The scores from the language model arenormalised over all choices to give fractional counts, such that each source-language sentence has associated with it a set of target-language transla-tions. Each target-language sentence has a fractional count correspondingto its share of the probability mass of all the possible translations of thesource sentence. When training, we use these fractional counts in place ofthe aligned-word counts in the parallel method.

The remainder of the chapter is laid out as follows: first we present thecommon methodology, then in turn the method of collecting counts fromparallel corpora, and monolingual corpora. There is then an evaluation ofthe methods with respect to the rules we saw in the previous chapter, and

and are better termed segments, sequences or chunks, but here we follow the normal SMTnomenclature.

4The position-independent word error rate () is the word-error rate but without takinginto account position. We quote the results here as opposed to or as thephrase-based model also improves word order, which is not relevant to the task of lexicalselection.

4.2. COMMON METHODOLOGY 39

“ Harik eta itsas xori handi bat, Albatroa deitzen zutena, …” 5

n n-grams1 [ handi ]2 [ xori handi ] [ handi bat ]3 [ itsas xori handi ] [ xori handi bat ] [ handi bat , ]

4 [ eta itsas xori handi ] [ itsas xori handi bat ] [ xori handi bat , ][ handi bat , Albatroa ]

5[ Harik eta itsas xori handi ] [ eta itsas xori handi bat ][ itsas xori handi bat , ] [ xori handi bat , Albatroa ][ handi bat , Albatroa deitzen ]

Table 4.1: The 1 to 5-grams around the word handi ‘big, great’ in a sentence inBasque. The brackets denote the borders between separate n-grams for each valueof n.

other reference results from the literature. We end with an analysis anddiscussion of the results.

4.2 Common methodologyAs we described in section 3.1 in the previous chapter, a lexical-selectionrule is made up of: a source language word, a target language word, anoperation ( or ) and a fixed-length context. The source wordand set of possible target-language translations are available in the bilingualdictionaries of the machine translation system in question. So, the taskof any rule-learning process is to find the contexts, and operations. In thischapter, we keep to learning rules with a single operation. In order tolearn the contexts, we rely on counting n-grams around an ambiguous word,together with its target-language translations. An n-gram is a contiguoussequence of n words. Table 4.1 gives some examples of n-grams extractedfrom a sentence.

The general method relies on generating source-language n-grams con-taining an ambiguous word, where the ambiguous word is annotated withits translation. We then count (see sections 4.3 and 4.4) how often eachtranslation appears along with each n-gram, and generate a rule which se-lects the most frequent translation of the the source-language word in thatn-gram context. For this simple method to work we need to be able todifferentiate between those rules that improve lexical selection from those

5From Marinel zaharraren balada (Sarrionandia, 1995), translation in Basque of SamuelTaylor Coleridge’s Rime of the Ancient Mariner.


which do not, as not all rules generated are adequate (Tyers et al., 2012).In Tyers et al. (2012) we obtained improved results only after applyingad hoc filters to prune the rules. The first was to remove hapax rules —those where the context was only seen once in the corpus; the second wasto remove rules where the translation had a share of the frequency below acertain arbitrary threshold. Thus, when extracting rules from corpora, weneed some method of distinguishing rules which will improve lexical-selectionperformance, from those which will not. In this chapter, we formalise therule-inclusion threshold as the ratio of the alternative translation to themost-frequent context-independent translation below which a rule will notbe generated.

Algorithm 2 -: Algorithm to process the n-gram countsfrom collected from a corpus into lexical selection rules.Require: C, V, count(c, si, tj), θ: Set of contexts, set of ambiguous SL

words, counts of SL words in context along with TL translations, ra-tio threshold θ

1: for all v ∈ V do2: ζ∗ ← argmaxζ ∪c count(c, v, t) /* Calculate the default translation */3: for all c ∈ C do4: t∗ ← argmaxt count(c, v, t)5: ξ = count(c, v, t∗)/count(c, v, ζ∗)6: if ξ > θ then7: -(c, v, t∗)8: end if9: end for10: end for

In Algorithm 2 we present the algorithm for extracting rules from thecorpus. The input to the algorithm is a set of n-gram contexts, C, a set ofambiguous SL words for which a translation has been seen in the corpus, V ,a threshold θ, along with frequency counts of ambiguous SL words in contextalong with their translations count(c, si, tj). For each SL word in context,the algorithm generates a rule if the ratio ξ of the alternative translationto the default translation ζ is higher than the threshold θ. This algorithmrelies on two external functions, count which returns frequency counts fromthe corpus, and - which, given a context c, an SL word v and atranslation t of the SL word, creates a rule that selects the translation t ifword v is matched in context c. The function count depends on the learningmethod. The next two sections describe two possible implementations ofthis function. The sections also describe how we can find an appropriatevalue for the rule-inclusion threshold given the available data.

4.3. SUPERVISED LEARNING FROM A PARALLEL CORPUS 41

4.3 Supervised learning from a parallel corpusThis section describes how counts are collected when using the supervisedlearning method based on word-alignments from a parallel corpus, and showshow the threshold can be found based on finding the minimum error rate ina development corpus.

4.3.1 Word alignmentGiven a parallel corpus such as that in Table 4.2, the set of word alignments afor a pair of sentences is a set of pairs (i, j) where i is an index to a word in thesource-language sentence, and j is an index to a word in the target-languagesentence. Words in one language may be aligned to zero or more words inthe other language. The word alignments used in this thesis are generatedby running traditional word-based statistical machine translation modelswith ++ Och and Ney (2003) and extracting the Viterbi alignment, ormost probable alignment according to the translation models. The aligner isrun in both translation directions and then the alignments are symmetrisedusing the grow-diag-final-and heuristic as implemented in M (Koehnet al., 2007).6

4.3.2 TrainingThe parallel corpus consists of a collection of samples, G = (S, T, a), whereS = (s1, s2, . . . s|s|)

7 is a sequence of source-language words, T = (t1, t2, . . . t|t|)is a sequence of target-language words, and a ⊆ [1, |S|] × [1, |T |] is the setof alignment pairs between the words. Table 4.2 presents an example of fivepossible samples. Along with the parallel corpus, we have a function Ts(si)(see section 2.2.1) which returns, for source language word si the set ofpossible target-language translations {t1, t2, . . . tn} as found in the bilingualdictionary of the machine-translation system, see example in Figure 4.1.

Given this information, and a variable n indicating maximum contextsize, we use Algorithm 3 to collect counts. We first check each sample in thecorpus against two conditions:

• |Ts(si)| > 1. The sample must contain at least one ambiguous word.

• ∃!j ∈ [1, |T |] : (i, j) ∈ a ∧ tj ∈ Ts(si). There is only one translation tjwhich is found aligned to si and which is also in the set of translationsreturned by the MT system Ts(si).

Providing these conditions hold, we continue to process the sample. Wefirst add the ambiguous word to the set of ambiguous words V , and then for

6The exact configuration of ++ used is equivalent to running the M toolkit indefault configuration up to step three of training.

7We use the set cardinality notation here to indicate the size of the sequence.


ID Sentence

1S Arrain1 handiak2 txikia3 jaten4 du5T El1 pez2 grande3 se4 come5 el6 pequeño7a (1, 2) (2, 3) (3, 7) (4, 5)

2S Aitak1 arraina2 prestatu3 digu4 afaltzeko5T Papá1 nos2 ha3 preparado4 pescado5 para6 cenar7a (1, 1) (2, 5) (3, 4) (4, 3) (5, 7)

3S Hemen1 arraina2 oso3 goxoa4 da5T Aquí1 el2 pescado3 es4 muy5 rico6a (1, 1) (2, 3) (3, 5) (4, 6) (5, 4)

4S Itsasoko1 arrain2 handi3 bat4 da5T Es1 un2 pescado3 grande4 del5 mar6a (1, 6) (2, 3) (3, 4) (4, 2) (5, 1)

5S Arrain1 handiak2 horiez3 baliatzen4 ziren5 elikatzeko6T Los1 peces2 grandes3 se4 alimentaban5 de6 esos7a (1, 2) (2, 3) (3, 7) (6, 5)

Table 4.2: A small example of a parallel corpus for the Basque–Spanish pair. S isthe sequence of words in the source language, Basque. T is the sequence of wordsin the target language, Spanish. And a is the set of alignment pairs between thewords.

s Hemen arrain -a oso goxo -a ukan

T(si){

aquí}{ pez

pescado

}{el

}{muy

}ricodulcesuave

{el

}{ser

}s Itsaso -ko arrain handi bat izan

T(si){

mar}{

de}{ pez

pescado

}{grandecapaz

}{un

}{ser

}Figure 4.1: Example of the output of the lexical-transfer module of our Basque–Spanish machine translation system for two SL sentences from our parallel corpus inTable 4.2. Before being passed to the lexical-transfer module, the input is tokenised,lemmatised and tagged for part of speech. The presence of a hyphen in front of alemma (for example -ko) indicates that it is a clitic, which in the orthography isattached to the previous word. Where a word is underlined, it indicates that thisis an appropriate selection in the given context.

4.3. SUPERVISED LEARNING FROM A PARALLEL CORPUS 43

each context n-gram c, we first add the context to the set of contexts C, andthen update the context counts accordingly. These counts are then used incalculating the rule-threshold, and for generating the rules as described insection 4.2.

Algorithm 3 -: Algorithm to collect the n-gram countsfrom a parallel corpus.Require: G, n: Collection of samples, maximum context size1: for all (S, T, a) ∈ G do2: for all i ∈ [1, |S|] do3: if |Ts(si)| > 1 ∧ ∃!j ∈ [1, |T |] : (i, j) ∈ a ∧ tj ∈ Ts(si) then4: V ← V ∪ {si}5: for all c ∈ ngrams(s, i, n) do6: C ← C ∪ {c}7: count(c, si, tj)← count(c, si, tj) + 18: end for9: end if10: end for11: end for

Suppose we have the parallel corpus in Table 4.2 and are translatingfrom Basque to Spanish. Table 4.3 presents the information we would haveat the end of this algorithm. The source language word si, along with a setof n-gram contexts C. Note that there may also be no context, the result ofwhich gives us the count of the most-often-aligned translation (; thelast row in Table 4.7). If we imagine that our search for the most adequatethreshold gives us a value of θ = 1.5, then in this case, four rules would begenerated:

• ‘pescado’ for arrain with no context

• ‘pez’ for arrain in the context (_, handi)

• ‘pez’ for arrain in the context (_, handi, -a)

• ‘pez’ for arrain in the context (_, handi, -a, -k)The first rule being the most-often-aligned translation, and the others

being exceptions to this in fixed contexts. Note, here we assume a threshold,but, in fact, there are different thresholds that generate different rule-sets.The next section describes how we find the most adequate threshold for agiven language pair and corpus.

4.3.3 Finding the rule-inclusion thresholdThe possible values for θ can be extracted from the corpus by calculatingequation 4.1 for each rule (c, v, t∗):


si c count(c, si, ‘pez’) count(c, si, ‘pescado’)arrain arrain handi -a -k 2 0

arrain handi -a 2 0arrain handi 2 1

-ko arrain handi 0 1-ko arrain 0 1-k arrain -a 0 1-k arrain 0 1

itsaso -ko arrain handi bat 0 1itsaso -ko arrain 0 1hemen arrain -a 0 1hemen arrain 0 1

arrain handi bat izan 0 1arrain handi bat 0 1

arrain -a prestatu ukan 0 1arrain -a prestatu 0 1arrain -a oso goxo 0 1arrain -a oso 0 1

aita -k arrain -a prestatu 0 1aita -k arrain 0 1arrain -a 0 2arrain 2 3

Table 4.3: The n-gram counts for the ambiguous word arrain ‘pez, pescado’, ascollected from the test corpus in Table 4.2. The most-often-aligned translation,obtained with no context, is arrain → ‘pescado’.

4.4. UNSUPERVISED LEARNING FROMMONOLINGUAL CORPORA45

Pair Range Values θ

br-fr [1.1, 30.0] 33 3.0mk-en [1.1, 170.0] 110 2.0eu-es [1.1, 176.0] 256 3.2en-es [1.1, 141.0] 323 2.5

Table 4.4: Searching for the optimal value of θ. For each of the possible values, weperform an exhaustive search in between the range of possible values, as extractedfrom the corpus, and select the value with the lowest error rate. In the case thatseveral values have the same error rate, we choose the lowest value. A θ of 3.0 meansthat a translation which is not the the most-often-aligned translation must be foundthree times as often as the most-often-aligned translation in a given context in orderto create a rule.

ξ =count(c, v, t∗)count(c, v, ζ∗) (4.1)

Given the set of possible values, we run Algorithm 2 (-) foreach of the values to generate the possible rule-sets. We then translate thedevelopment corpus with each of the rule-sets, and calculate the lexical-selection error rate. Figure 4.2 shows the change in as the thresholdincreases for one language pair. One possibility to find the threshold wouldbe to use a typical line search algorithm such as ternary search, but given thelow number of values actually found for each of the language pairs, and thefact that the function is not continuous, we have performed an exhaustivesearch.

Table 4.4 gives the optimal values of θ found for each of the languagepairs.

4.4 Unsupervised learning from monolingual cor-pora

We have seen in the previous section how a parallel corpus may be usedto collect counts of contexts and translations in order to generate lexical-selection rules. However, as we comment in the overview (section 4.1), par-allel corpora are not available for the majority of the world’s written lan-guages. In this section we describe an unsupervised method to learn rulesfrom monolingual corpora.

The input to our method consists of a collection of samples, G = (S,G),where S = (s1, s2, . . . s|S|) is a sequence of source-language words, and G ={g1, g2, . . . g|G|} is a set of possible lexical-selection paths. A lexical-selectionpath g = (t1, t2, t|s|) is a sequence of lexical-selection choices of those source


6

6.5

7

7.5

8

8.5

9

20 40 60 80 100

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

Rule-inclusion threshold θ

Most-frequently-aligned translation

Rules

6

6.5

7

7.5

8

8.5

9

2 4 6 8 10 12 14 16 18 20

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

Rule-inclusion threshold θ

Most-frequently-aligned translation

Rules

Figure 4.2: Graphs showing the evolution of the lexical-selection error rate forthe English–Spanish pair on the development corpus as the rule-inclusion thresholdincreases. The number of possible values of θ is small enough (fewer than 400) tobe able to carry out an exhaustive search to find the optimal value. In this case,the optimal value of θ is between 2.5 and 2.8, yielding a of 7.1% compared to8.2% using the most-often-aligned translation. The second graph is a magnificationof the first part of the first graph.


language words. We also have a function τ(gi, S) which returns a completetranslation (after transfer and generation) of the lexical-selection path gi ofsentence S.

Taking our previous example for the parallel corpus, Table 4.5 presentsan example of five possible samples. The value for fractional count, p(g|S)is computed by scoring the translation produced for each lexical-selectionpath on a target-language model, PTL, and normalising the result:

p(gi|s) =PTL(τ(gi, S))∑

gi∈G PTL(τ(gi, S))(4.2)

Algorithm 4 -: Algorithm to collect the n-gram countsfrom monolingual corpora.Require: G, n: Collection of samples, maximum context size1: for all (S,G) ∈ G do2: for all i ∈ [1, |S|] do3: if |Ts(si)| > 1 then4: V ← V ∪ {si}5: for all c ∈ ngrams(s, i, n) do6: C ← C ∪ {c}7: for all t ∈ Ts(si) do8: for all gi ∈ G do9: for all gij ∈ gi do10: if t = gij then11: count(c, si, t)← count(c, si, t) + p(gi|S)12: end if13: end for14: end for15: end for16: end for17: end if18: end for19: end for

Given this information, Algorithm 4 proceeds much in the same way aswith the supervised method. We iterate through all of the samples, ensuringthat in each sample there is at least one ambiguous word. The main differ-ence is that instead of adding one each time we see an aligned translation,here we add the fractional count as calculated from the language model. Ta-ble 4.6 shows the result of applying this process to the example in Table 4.5.The result of generating rules from these counts, supposing, where definedwe use the same value for θ as with the supervised learning method would


S Sentence p(gi|S)

S1

Arrain handiak txikia jaten duτ(g1, S) El pez grande se come el pequeño 0.830τ(g2, S) El pescado grande se come el pequeño 0.134τ(g3, S) El pescado capaz se come el pequeño 0.034τ(g4, S) El pez capaz se come el pequeño 0.001

S2

Aitak arraina prestatu digu afaltzekoτ(g1, S) Padre nos ha preparado pescado para cenar 0.846τ(g2, S) Papá nos ha preparado pescado para cenar 0.134τ(g3, S) Padre nos ha preparado pez para cenar 0.015τ(g4, S) Papá nos ha preparado pez para cenar 0.005

S3

Hemen arraina oso goxoa daτ(g1, S) Aquí el pescado es muy rico 0.912τ(g2, S) Aquí el pescado es muy suave 0.067τ(g3, S) Aquí el pez es muy rico 0.011τ(g4, S) Aquí el pez es muy suave 0.008τ(g5, S) Aquí el pez es muy dulce 0.001τ(g6, S) Aquí el pescado es muy dulce 0.001

S4

Itsasoko arrain handi bat daτ(g1, S) Es un pez grande del mar 0.595τ(g2, S) Es un pescado grande del mar 0.403τ(g3, S) Es un pez capaz del mar 0.001τ(g4, S) Es un pescado capaz del mar 0.001

S5

Arrain handiak horiez baliatzen ziren elikatzekoτ(g1, S) Los peces grandes se nutrían de esos 0.862τ(g2, S) Los pescados grandes se nutrían de esos 0.121τ(g3, S) Los peces grandes se alimentaban de esos 0.012τ(g4, S) Los peces capaces se alimentaban de esos 0.001τ(g5, S) Los pescados grandes se alimentaban de esos 0.001τ(g6, S) Los pescados capaces se alimentaban de esos 0.001τ(g7, S) Los peces capaces se nutrían de esos 0.001τ(g8, S) Los pescados capaces se nutrían de esos 0.001

Table 4.5: The Basque–Spanish monolingual corpus. The table presents the sourcelanguage sentences, and output of the possible lexical-selection paths after scoringon a target-language model. The scores are normalised as fractional counts.


si n-gram count(‘pez’,si) count(‘pescado’,si)arrain arrain handi -a -k 1.707 0.292

arrain handi -a 1.707 0.292arrain handi 2.303 0.696

-ko arrain handi 0.596 0.404-ko arrain 0.596 0.404-k arrain -a 0.020 0.980-k arrain 0.020 0.980

itsaso -ko arrain handi bat 0.596 0.404itsaso -ko arrain 0.596 0.404hemen arrain -a 0.020 0.980hemen arrain 0.020 0.980

arrain handi bat izan 0.596 0.404arrain handi bat 0.596 0.404

arrain -a prestatu ukan 0.020 0.980arrain -a prestatu 0.020 0.980arrain -a oso goxo 0.020 0.980arrain -a oso 0.020 0.980

aita -k arrain -a prestatu 0.020 0.980aita -k arrain 0.020 0.980arrain -a 0.040 1.960arrain 2.343 2.656

Table 4.6: A table showing the fractional counts for the ambiguous word arrain‘pez, pescado’, as collected from the test corpus in Table 4.5. The most frequenttranslation is arrain → ‘pescado’.

be the same as the supervised method. The most-likely translation wouldbe pescado and three context rules would be generated for the alternativetranslation pez.

4.4.1 Finding the rule-inclusion thresholdFinding the threshold for the unsupervised learning method can be donethe same was as for the supervised learning method using an automaticallyannotated development corpus. However, when calculating the inclusionthreshold, θ, for the rules learnt using the fractional count method, there isa problem of having too many values of θ (68,053 in the case of the English–Spanish pair) to be able to perform an exhaustive search. This is due tothe fact that rules with low frequency, and summed partial counts below 1.0return very high values of ξ. One possibility for finding the best θ was to usea typical search algorithm such as ternary search. However, experimentallywe found that this was not adequate. The problems are actually twofold.


The first problem is that for a large part of the search space, as we decreasethe threshold, the number of rules increases, but no rules are applied, thismeans that they have no effect when we calculate the error rate on thedevelopment corpus. This plain (which can be seen on the right-hand sideof figure 4.3) represents the vast majority of the search space of θ. The ruleswhich are generated in this area, those with a very high θ are typically thosewhich have a low frequency in the training corpus, and therefore are unlikelyto be applied in the development corpus. So, for high values of θ, we areadding very few, low frequency rules, which might be reliable, but are notbeing applied. To avoid the problem of the search algorithm getting lost inthe plain, we opt for searching by sampling the space geometrically in stepsof 2(1/8). Searching in this way may lead us to miss the best value for θ, butit allows us to find a better value than other methods. The value of 2(1/8)was chosen as we found it was indistinguishable from those obtained with asmaller step.

The reader will note that this means that we cannot assume a com-pletely monolingual training environment, as given no parallel corpus, it isnot possible to tune against an annotated reference translation. Thus, weneed to find another metric for indicating translation improvement or wors-ening. We experimented with using the product of the LM probabilities forsentences translated from the development set as an indicator of translationimprovement. Unfortunately this was not fruitful, and so at present eventhough the training is entirely unsupervised, finding the threshold requiresan annotated development corpus.

4.5 ExperimentsReference results

We compare our methods against the following reference systems:

• Linguistic defaults. As described in section 2.4, this is the trans-lation that is perceived as the most frequent or the most general, itis marked as the default by the linguist who is creating the bilingualdictionary.

• Hand-written rules. To compare the automatic methods of rule-writing with humans, we reused the rules described in section 3.4.

• Target language model. One method of lexical selection is to usethe existing machine translation system to generate all the possibletranslations for an input sentence, and then score these translationson-line on a model of the target language. The highest scoring sentenceis then picked as output. This is the method used by Carbonell et al.

4.5. EXPERIMENTS 51

10

15

20

25

30

1 100 10000 1e+06 1e+08 1e+10 1e+12

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

Rule-inclusion threshold log(θ)

Most-likely translation

Rules

Figure 4.3: Graph showing the evolution of the lexical-selection error rate forthe English–Spanish pair on the held-out development corpus as a function of therule-inclusion threshold decreases. The number of possible values of θ is too highto be able to carry out an exhaustive search to find the optimal value. We samplein steps of 21/8. In the case of this pair, the optimal value of θ is between 200,000and 600,000, yielding a of 10.8% compared to 11.0% using the most-likelytranslation. In this case the improvement using rules with respect to using themost-likely translation is not significant at p = 0.95.


(2006) and Melero et al. (2007). The TLM system is a five-gram lan-guage model of surface forms generated from the target-language sideof the parallel corpus (see section 2.3). This was the best performingsystem when we compared different approaches, but is impractical forreal-world MT because the number of translations to perform growsexponentially with the length of the sentence. It represents the bestthat can be achieved using the current systems without access to a par-allel corpus. It was implemented with the IRST language modellingtoolkit (Federico et al., 2008).

• Most-often aligned. This is the result achieved by choosing the mostoften aligned translation in the parallel corpus. This will be referredto with the abbreviation MOAT.

Table 4.7 shows the LER and BLEU reference results for each of thefour language pairs, and each of the four systems. The results from thelinguistic defaults can be considered as the baseline for any method to beat,using the knowledge currently in the system. The TLM system could beconsidered to be the best that can be achieved without access to a parallelcorpus. Choosing the most-often-aligned translation may be regarded as thebest method that can be achieved while having access to a parallel corpus,but not taking into account translation context. In addition to these fourreference systems, we also compare against the following system:

• Most-likely. These can be thought of as the default translations ascollected by summing up the fractional counts for each of the trans-lations independent of context. This is very similar to the methoddescribed in Koehn and Knight (2000). It is different from the TLMsystem as the counts are averaged over occurrences of the same con-text, and context is not taken into account at selection-time, only attraining time.

This is a novel method, and so will not be referred to as a referencesystem, but does not directly incorporate context at lexical-selection time,so will be considered apart from the other novel methods.

4.6 ResultsIn terms of training data, we see that very little parallel data is neededto improve performance over the linguistic defaults (Figures 4.4, 4.5, 4.6,4.7) and the target-language model (TLM). For all language pairs exceptMacedonian–English (Figure 4.6), around 100 sentences is sufficient to out-perform the linguistic defaults, and around 1,000 to outperform the TLM.However, the reader will recall that the Macedonian–English linguistic de-faults were made with reference to the corpus used for both training and

4.6. RESULTS 53

Pair Metric SystemLing Hand TLM MOAT

br-fr (%) [54.8, 60.7] [54.3, 60.2] [44.2, 50.5] [29.7, 35.1] (%) [14.5, 16.4] [14.6, 16.5] [15.1, 17.0] [15.0, 16.9]

mk-en (%) [28.8, 32.6] [26.3, 30.0] [26.8, 30.5] [19.0, 22.2] (%) [28.6, 31.0] [28.7, 31.1] [30.7, 32.3] [29.9, 32.3]

eu-es (%) [43.6, 48.8] [43.4, 48.5] [38.8, 44.2] [16.5, 20.8] (%) [10.1, 12.0] [10.1, 12.0] [10.6, 12.6] [11.1, 13.1]

en-es (%) [20.4, 24.7] [20.4, 24.7] [15.1, 18.9] [7.2, 10.0] (%) [21.6, 23.5] [21.6, 23.5] [21.9 23.8] [22.1, 24.0]

Table 4.7: LER and BLEU scores with 95% confidence intervals for the referencesystems on testing data. With the exception of mk-en, TLM outperforms ling with95% confidence. The system of selecting defaults from the parallel corpus, MOAToutperforms TLM in all cases.

testing. Context rules start being created and applied by around 500–600sentences of training data, however to start with they tend to cause an in-crease in error rate. This starts to decrease by 2,000–4,000 sentences. In allcases, with the exception of Breton–French, where there was not sufficientdata, the improvement starts to level out after around 10,000 sentences.

In comparison, for the monolingual training method (Figures 4.8, 4.9,4.10, 4.11), around ten times the data is needed, with TLM performanceonly being achieved or exceeded at around 10,000 sentences. The exceptionbeing Breton–French (Figure 4.11) where TLM-like performance is achievedafter only 2,448 sentences, the maximum size of the training corpus.8 Onlyin the pairs with substantially more data (Basque–Spanish and English–Spanish) do we see the error rate start to level off with the monolingualtraining method.

The graphs showing the evolution of coverage (the second of the twographs in Figures 4.8, 4.9, 4.10, 4.11) show that rules incorporating contextare rarely applied. Table 4.8 summarises this information and gives detailsof the number of rules created for each language pair and training method.The parallel rules, although numerically fewer, are consistently applied morethan the rules learnt monolingually. If we take the difference of the lexical-selection error rate between the most-often-aligned or most-likely translationand the context rules, and the number of rules applied, we can calculate thesuccess rate of the rules. That is, how often a rule which is called changesthe translation to the one that is found in the reference. This is similar tothe naïve accuracy in Chapter 3.

8Although monolingual corpora are easier to find, the intention is to present comparableresults for each of the combinations of language pair and corpus.


Pair Parallel MonolingualRules Coverage Success Rules Coverage Success

br-fr 18 3.48% 89% 158 0.43% 48%mk-en 454 1.97% 30% 6733 1.87% 11%eu-es 414 3.34% 20% 336 0.00% –en-es 944 2.40% 37% 6584 0.27% 19%

Table 4.8: Number of context rules extracted for the two training methods. Thecoverage column shows the coverage of the rules. For example, there are 1,181ambiguous words in the Breton–French test set, context rules are called 41 times,thus the coverage is 3.48%.

It is interesting to note that for the parallel rules, the percentage of rulesapplied, and the improvement in error rate is similar to the best case in theevaluation of hand-written rules (see Chapter 3).

4.6.1 Comparison with reference systemsThe comparison between the reference systems and the parallel corpus learn-ing method is summarised in Figures 4.12, 4.13, 4.14, and 4.15. For themonolingual learning method, the results are summarised in Figures 4.16,4.17, 4.18, and 4.19. The differences in performance are small and the con-fidence intervals overlap for each of of the metrics. For the measure,they also overlap in most cases with the Oracle system (see section 2.4).Given this, to determine if the improvements were statistically significant,we performed pair-bootstrap resampling for both measures between the sys-tems. For the and metrics, comparing all of the methods againstthe linguistic defaults, there is a statistically significant (p = 0.95) improve-ment in performance, with the exception of for Macedonian–English.

The rule sets learnt monolingually, the improvement for both metrics isstatistically significant (p = 0.95) compared to the target-language model forthe English–Spanish and Macedonian–English pairs. For the Breton–Frenchpair, the improvement in is small, but significant.

It is worth noting that, although the monolingual learning method doesnot give a statistically significant (p = 0.95) improvement over using theTLM in all cases, it does give an improvement over using the linguisticdefaults, and it also approximates the TLM performance, using only source-language context information.

4.7 DiscussionIn this chapter we have presented a general method for learning the lexical-selection rules described in the previous chapter. The method can take

4.7. DISCUSSION 55

0

5

10

15

20

25

30

35

40

1 10 100 1000 10000 100000 1e+06

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

Number of parallel sentences in training

Linguistic defaults

TLM

context rulesmost-often-aligned

0

20

40

60

80

100

1 10 100 1000 10000 100000 1e+06

Cov

erag

e (%

)

Number of SL training sentences

contextmost-often-aligned

Figure 4.4: English–Spanish. The first graph shows how the error rate reducesas more parallel data is used for training. The two curves show the effect of usingonly rules based on only the most-often-aligned translation, and using the samerules in combination with rules incorporating context. The two horizontal lines arethe linguistic defaults and the target-language model best. The error bars showthe 95% confidence intervals. The second graph shows the coverage of the contextrules and the most-often-aligned translation.


0

10

20

30

40

50

60

1 10 100 1000 10000 100000

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM


0

20

40

60

80

100

1 10 100 1000 10000 100000

Cov

erag

e (%

)



Figure 4.5: Basque–Spanish. The first graph shows how the error rate reducesas more parallel data is used for training. The two curves show the effect of usingonly rules based on only the most-often-aligned translation, and using the samerules in combination with rules incorporating context. The two horizontal lines arethe linguistic defaults and the target-language model best. The error bars showthe 95% confidence intervals. The second graph shows the coverage of the contextrules and the most-often-aligned translation.

4.7. DISCUSSION 57

0

10

20

30

40

50

60

1 10 100 1000 10000 100000

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM


0

20

40

60

80

100

1 10 100 1000 10000 100000

Cov

erag

e (%

)



Figure 4.6: Macedonian–English. The first graph shows how the error ratereduces as more parallel data is used for training. The two curves show the effectof using only rules based on only the most-often-aligned translation, and using thesame rules in combination with rules incorporating context. The two horizontallines are the linguistic defaults and the target-language model best. The error barsshow the 95% confidence intervals. The second graph shows the coverage of thecontext rules and the most-often-aligned translation.


0

10

20

30

40

50

60

70

1 10 100 1000 10000

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM


0

20

40

60

80

100

1 10 100 1000 10000

Cov

erag

e (%

)



Figure 4.7: Breton–French. The first graph shows how the error rate reducesas more parallel data is used for training. The two curves show the effect of usingonly rules based on only the most-often-aligned translation, and using the samerules in combination with rules incorporating context. The two horizontal lines arethe linguistic defaults and the target-language model best. The error bars showthe 95% confidence intervals. The second graph shows the coverage of the contextrules and the most-often-aligned translation.

4.7. DISCUSSION 59

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000 100000 1e+06

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM

context rulesmost-likely

0

20

40

60

80

100

1 10 100 1000 10000 100000 1e+06

Cov

erag

e (%

)



Figure 4.8: English–Spanish. The first graph shows error rate reducing asmore monolingual training data is added. The two curves show the effect of usingonly rules based selecting the most-likely translation calculated from the fractionalcounts from the TLM, and using the same rules in combination with rules incor-porating context. The two horizontal lines are two of the reference results. Errorbars show the 95% confidence intervals. The second graph shows the coverage ofthe context rules and the most-likely translation.


0

10

20

30

40

50

60

1 10 100 1000 10000 100000

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM


0

20

40

60

80

100

1 10 100 1000 10000 100000

Cov

erag

e (%

)



Figure 4.9: Basque–Spanish. The first graph shows error rate reducing asmore monolingual training data is added. The two curves show the effect of usingonly rules based selecting the most-likely translation calculated from the fractionalcounts from the TLM, and using the same rules in combination with rules incor-porating context. The two horizontal lines are two of the reference results. Errorbars show the 95% confidence intervals. The second graph shows the coverage ofthe context rules and the most-likely translation.

4.7. DISCUSSION 61

0

10

20

30

40

50

60

1 10 100 1000 10000 100000

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM


0

20

40

60

80

100

1 10 100 1000 10000 100000

Cov

erag

e (%

)



Figure 4.10: Macedonian–English. The first graph shows error rate reducingas more monolingual training data is added. The two curves show the effect ofusing only rules based selecting the most-likely translation calculated from thefractional counts from the TLM, and using the same rules in combination withrules incorporating context. The two horizontal lines are two of the reference results.Error bars show the 95% confidence intervals. The second graph shows the coverageof the context rules and the most-likely translation.


0

10

20

30

40

50

60

70

1 10 100 1000 10000

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)


Linguistic defaults

TLM


0

20

40

60

80

100

1 10 100 1000 10000

Cov

erag

e (%

)



Figure 4.11: Breton–French. The first graph shows error rate reducing asmore monolingual training data is added. The two curves show the effect of usingonly rules based selecting the most-likely translation calculated from the fractionalcounts from the TLM, and using the same rules in combination with rules incor-porating context. The two horizontal lines are two of the reference results. Errorbars show the 95% confidence intervals. The second graph shows the coverage ofthe context rules and the most-likely translation..

4.7. DISCUSSION 63

5

10

15

20

25

30

35

40

Ling TLM MOAT Rules

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

20

21

22

23

24

25

Ling TLM MOAT Rules Oracle

BLE

U (

%)

Figure 4.12: Lexical-selection error rate (top) and BLEU results (bottom) forthe parallel-corpus learning method, along with reference results for the English–Spanish pair. This can be seen as a summary of Figure 4.4. The error bars showthe 0.95 confidence intervals.


20

30

40

50

60

Ling TLM MOAT Rules

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

10

10.5

11

11.5

12

12.5

13

13.5

14


BLE

U (

%)

Figure 4.13: Lexical-selection error rate (top) and BLEU results (bottom) forthe parallel-corpus learning method, along with reference results for the Basque–Spanish pair. This can be seen as a summary of Figure 4.5. The error bars showthe 0.95 confidence intervals.

4.7. DISCUSSION 65

15

20

25

30

35

40

45

50

55

60

Ling TLM MOAT Rules

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

28

29

30

31

32

33

34


BLE

U (

%)

Figure 4.14: Lexical-selection error rate (top) and BLEU results (bottom) for theparallel-corpus learning method, along with reference results for the Macedonian–English pair. This can be seen as a summary of Figure 4.6. The error bars showthe 0.95 confidence intervals.


25

30

35

40

45

50

55

60

65

70

Ling TLM MOAT Rules

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

13

14

15

16

17

18

19


BLE

U (

%)

Figure 4.15: Lexical-selection error rate (top) and BLEU results (bottom) for theparallel-corpus learning method, along with reference results for the Breton–Frenchpair. This can be seen as a summary of Figure 4.7. The error bars show the 0.95confidence intervals.

4.7. DISCUSSION 67

5

10

15

20

25

30

Ling TLM MLT TLM-Rules MOAT

Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

21

21.5

22

22.5

23

23.5

24

24.5

25

Ling TLM MLT TLM-Rules MOAT Oracle

BLE

U (

%)

Figure 4.16: Lexical-selection error rate (top) and BLEU results (bottom) for themonolingual-corpus learning method, along with reference results for the English–Spanish pair. This can be seen as a summary of Figure 4.8. The error bars showthe 0.95 confidence intervals. Note the difference between the TLM and TLM-Rulessystems.


15

20

25

30

35

40

45

50

55


Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

10

10.5

11

11.5

12

12.5

13

13.5

14


BLE

U (

%)

Figure 4.17: Lexical-selection error rate (top) and BLEU results (bottom) for themonolingual-corpus learning method, along with reference results for the Basque–Spanish pair. This can be seen as a summary of Figure 4.9. The error bars showthe 0.95 confidence intervals.

4.7. DISCUSSION 69

18

20

22

24

26

28

30

32

34


Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

28

29

30

31

32

33

34


BLE

U (

%)

Figure 4.18: Lexical-selection error rate (top) and BLEU results (bottom)for the monolingual-corpus learning method, along with reference results for theMacedonian–English pair. This can be seen as a summary of Figure 4.10. Theerror bars show the 0.95 confidence intervals.


25

30

35

40

45

50

55

60

65


Lexi

cal-s

elec

tion

erro

r ra

te (

LER

, %)

13

14

15

16

17

18

19


BLE

U (

%)

Figure 4.19: Lexical-selection error rate (top) and BLEU results (bottom) for themonolingual-corpus learning method, along with reference results for the Breton–French pair. This can be seen as a summary of Figure 4.11. The error bars showthe 0.95 confidence intervals.

4.7. DISCUSSION 71

advantage of both parallel corpora and monolingual corpora. In the casethat a parallel corpus is available, it works by looking at the difference infrequency in word-aligned context of the most-often-aligned or most-likelytranslation, and an alternative. In the case that a parallel corpus is notavailable, a novel method relying on fractional counts has been presentedwhich performs better or similarly, based on only source-language contextas a target-language model.

We have shown that the real boost to performance as opposed to usingthe linguistic defaults comes from changing the default translation to suit thecorpus, both in the parallel learning strategy, and the monolingual learningstrategy, the inclusion of context only counts for under 5% of the rulesapplied, and in the case of monolingual rules under 2%. This would seem toconfirm the Yarowskian observation that there is a strong, single-sense perdiscourse factor (Gale et al., 1991). It is interesting to note that the greatestimprovement was found in the Breton–French pair, which, although it hasthe smallest amount of data, also has the most diverse corpus, with textsfrom various domains. As a practical matter, this suggests that choosingthe default translation is not really necessary as it can easily be learnt. Thiscould remove a lot of work associated with creating a new language pair,and also it could provide a kind of instant tuning to a particular domain ortask.

However, the fact that there remains 10–30% of error depending on lan-guage pair compared to the Oracle suggests that we are somehow missinguseful information for disambiguation, or are including non-useful or harmfulinformation. The next chapter presents a method of weighting the rules welearn in the maximum-entropy framework, and altering the rule-applicationprocess to attempt to take advantage of information in the corpus that wemay be discarding.

Chapter 5

Weighting

As we have seen in the previous chapter, it is possible to learn lexical se-lection rules in order to improve translation quality. However, there is astrong bias towards the most frequent translation: In many cases, choosingthe most frequent translation provides almost as good performance as usingthe rules. One reason for this could be that we are discarding useful data attwo stages. The first is in the rule-extraction process, when we enforce thestringent rule-inclusion threshold; the second is when we apply the rules toan input sentence, choosing the best coverage. On one hand, the thresholdfor selecting a good context rule — or set of context rules — can be as high as3.5 times the frequency of the most-often-aligned or most-likely translationin that context. This means that we are throwing away data which showa translation is twice or thrice as frequent. On the other hand, when weapply the rules to the input sentence, by only picking the the best coverage,that is the sequence of rules incorporating the greatest amount of context,we may be throwing away rules with a higher ratio, but incorporating lesscontext.

To improve translation quality above that which is achievable with therules learnt in the previous chapter we need to overcome both of these re-strictions: We need to avoid discarding useful context at both the learningstage, and at the rule-application stage.

With the threshold we described in the previous chapter, we were usinga crude method to find the most probable translation in context. Our as-sumption was that given no strong evidence to the contrary we should pickthe most frequent1 translation found in the corpus. In order to choose atranslation which was not the most-frequent, we imposed a restriction thatit be a given number of times more frequent in a fixed context than themost-frequent translation. This was formalised as a threshold θ of the ratio

1Here most frequent refers to either the most-often-aligned translation as calculated fromthe word alignments of the parallel corpus, or the most-likely translation as calculated bysumming the normalised probabilities from the set of fractional corpora in the monolinguallearning method.

73

74 CHAPTER 5. WEIGHTING

of alternative to most-frequent translation. In this chapter we present a well-formed probability model for finding the most probable translation based onthe principle of maximum entropy that addresses the two limitations of theprevious chapter.

5.1 Maximum-entropy lexical selectionLet the probability of a translation t being the translation of a word s in anSL context c be ps(t|c). In principle this value could be calculated directlyfrom the available corpora for every combination of (s, t, c) This would how-ever present two questions: The first is: how should be relevant contexts bechosen? and the second is: what to do with the translations of words whichare not found in the corpus? A maximum-entropy model answers both ofthese questions. It allows the contexts that we consider to be linguisticallyinteresting to be defined a priori and then integrate these seemlessly intoa probabilistic model (Manning and Schütze, 1999). And in answer to thesecond question, a maximum-entropy model assumes nothing about what isnot in the training data. That is, if there is no information in the trainingdata, then it assumes that all outcomes are equally likely. The principle ofmaximum entropy has been applied to the problem of lexical selection be-fore. Berger et al. (1996) cast the problem of lexical selection in statisticalMT as a classification problem. They learn a separate maximum-entropyclassifier for each SL word, using SL context to distinguish between possi-ble translations. These classifiers are then incorporated into the translationmodel of their word-based SMT system.

In their approach, a classifier consists of a set of binary feature functionsand corresponding weights for each feature. Features are defined in the formhs(t, c)2 where t is a translation, and c is an SL context. A feature wherepez is seen as the translation of arrain in the context arrain handi ‘big fish’would therefore be defined as:

harrain(t, c) =

{1 if t = pez and handi follows arrain0 otherwise (5.1)

During the training process, each feature in the classifier is assigned aweight λs, and combining these weights of active features as in equation 5.2yields the probability of a translation t for word s in context c.

ps(t|c) =1

Zexp

nF∑k=1

λskh

sk(t, c) (5.2)

In this equation, Z is a normalising constant. Thus, the most probabletranslation can be found using equation 5.3.

2The exact notation they use is f(x, y), where y is a French word and x is an Englishcontext. We include the source word s for each feature for clarity.

5.1. MAXIMUM-ENTROPY LEXICAL SELECTION 75

Itsaso -a -n arrain handi -e -k igeri egiten dute

(pez:1.8)(pez:1.7)

(capaz:0.4)

(pescado:1.2)(pescado:0.9)

(pescado:1.3)

(pez:0.8)

(grande:1.2)

(pez:0.9)

(pescado:1.0)

Figure 5.1: The maximum-entropy rule-application process. The features usedare the contexts from the rules learnt in the previous chapter. The weights for eachtranslation are summed and when a position in the sentence is reached where q0 isthe only alive state, the translation with the highest combined weights is picked.For the word arrain, the translation pez 0.8 + 0.9 + 1.7 + 1.8 = 5.2 is chosen overpescado 1.3 + 1.3 + 1.0 + 1.2 = 4.8, and for the word handi, the translation grande1.2 is chosen over capaz 0.4.

argmaxt∈Ts(s)

ps(t|c) = argmaxt∈Ts(s)

nF∑k=1

λskh

sk(t, c) (5.3)

The features they define are similar to a subset of the lexical-selectionrules we described in Chapter 3.

5.1.1 Rule applicationIn Chapter 3 we define a lexical selection rule as r = (c, U), where c is asequence of SL patterns, and U is a sequence of operations. Here we redefinea rule as r = (c, U, λ), where λ is a weight, and introduce an additionalrestriction: the rule may only contain a single select operation. This allowsus to treat every rule as a binary feature function hs(t, c).

In order to apply the rules to the input sentence, i.e. to see which featuresare active and to compute the probability ps(t|c) for all active features, wemodify the best-coverage algorithm introduced in Chapter 3. Instead ofchoosing the longest rules, we add up, for each target-language translationof each source-language word the weights of the rules that are active (referto equation 5.2). Once we have reached a position in the sentence wherethe only alive state in the transducer is the initial state q0, we backtrackand select the translation with the highest sum of weights. Figure 5.1 givesan example of this process for the Basque sentence Itsasoan arrain handiekigeri egiten dute. ‘The big fish swim in the sea’. The ambiguous words arearrain ‘pez, pescado’ and handi ‘grande, capaz’. There are ten rules active,


containing five SL contexts: (arrain) (arrain handi) (arrain handi -e) (arrainhandi -e -k) (handi).

5.2 ExperimentsAfter learning the rule-sets and weights, we run the same evaluation as inprevious chapters. There is an option to remove features which occur lessthan a certain frequency in the training corpus. This is referred to as thefeature pruning frequency threshold — features occuring less than thresholdare discarded. The value was set experimentally. Values of between two andseven were tested, and the ones which provided the best improvement onthe development corpus were selected. Given enough data, the rule-of-thumbvalue of five (Manning and Schütze, 1999, p.596) was found to be effective.Berger et al. (1996) describe an improvement of the generalised iterativescaling algorithm which works with non-binary features. However as weare working with binary features, we use the implementation of generalisediterative scaling available in the 3 to calculate the feature weights.

5.3 ResultsTable 5.1 shows the number of features that are generated for each languagepair. Evaluation results are presented in tables 5.2 and 5.3.

For the parallel training, the maximum-entropy method outperformsthe rules-and-threshold method from Chapter 4 for all language pairs andboth metrics. This improvement is statistically significant (p = 0.95) forthe two pairs with the least data, and also significant for LER for theEnglish–Spanish pair, which has the most data. The difference between themaximum-entropy method and the rules-and-threshold method is substan-tial in the case of Breton–French, the language pair with the least amountof training data. As regards the monolingual training the only improve-ment was seen with the Breton–French pair, again, the pair with the leastdata. However, in all cases, the maximum-entropy method comes close tothe target-language model performance, and outperforms the linguist-chosendefault translations. It is worth noting that, unlike the rules-and-thresholdmethod from the previous chapter, the maximum-entropy method is truelyunsupervised. No threshold is necessary, and thus no annotated develop-ment corpus is necessary for calculating it. In a truely monolingual setting,this would be a good alternative to the target-language model.

As the error rate was so low for the English–Spanish pair, we chose to doa differential evaluation compared to the oracle, to see if in fact, the errorrate was actually lower, this was the same as the differential evaluation in

3http://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html; for theexact version used see Appendix B.

5.4. DISCUSSION 77

Pair Parallel MonolingualPruned # features Pruned # features

br-fr < 2 821 < 5 5,277mk-en < 5 3,940 < 7 205,494eu-es < 5 9,855 < 7 196,024en-es < 5 29,402 < 7 195,605

Table 5.1: Number of features (rules) in each rule set. Given the small sizeof the Breton–French training set, the pruning frequency was reduced until animprovement was seen on the development set. The large difference in number offeatures between the training methods is explained by the fact that the monolingualmethod generates a feature for every possible translation, where the parallel methodonly generates a feature if the translation has been seen aligned in the corpus.

Chapter 3. We manually checked each of the translations output by thelexical-selection module to see if the selections were better or worse thanthe oracle. We found that, there were 88 differences between the outputof the maximum-entropy model and the oracle. Of these, in 43 cases, themaximum-entropy model chose a translation which was either equally asgood as the oracle (but not found in the reference) or better than the oracle.So, in practice, the performance could be better than the error rate suggests.

5.4 DiscussionUsing maximum-entropy classifers with supervised training on the parallelcorpus shows in nearly all cases a statistically significant (p = 0.95) improve-ment in lexical-selection performance. In the case of the Breton–French pairthe improvement is substantial. This can be explained by the fact that themaximum-entropy model allows us to include much more information thanthe rules-and-threshold model from the previous chapter. The disadvan-tage of the method when training monolingually is that substantially morerules are generated, which has an impact on speed. The main advantage isthat in the small-data scenario, the performance is substantially better thanthe other methods, both for the supervised training and the unsupervisedtraining.

One open question is why the unsupervised training does not work as wellfor the other language pairs. One possible explanation is that there may bemany different combinations of weights which maximise entropy. And thesedifferent combinations could have different results as classifiers (Berger et al.,1996). Thus, it could be the case that although we got a collection of weightswhich maximise entropy for the training set, these weights are not the bestfor classification.


Pair Metric SystemMOAT Rules MaxEnt Oracle

br-fr (%) [29.7, 35.1] [27.6, 33.0] [18.0, 22.6] [0.0, 0.0] (%) [15.0, 16.9] [15.3, 17.2] [15.8, 17.7] [16.7, 18.6]

mk-en (%) [19.0, 22.2] [18.5, 21.5] [16.3, 19.3] [0.0, 0.0] (%) [29.9, 32.3] [30.0, 32.4] [30.2, 32.6] [30.9, 33.3]

eu-es (%) [16.5, 20.8] [15.9, 20.2] [15.3, 19.4] [0.0, 0.0] (%) [11.1, 13.1] [11.1, 13.2] [11.1, 13.2] [11.5, 13.5]

en-es (%) [7.2, 10.0] [6.3, 9.1] [4.7, 7.1] [0.0, 0.0] (%) [22.1, 24.0] [22.3, 24.2] [22.4, 24.3] [22.8, 24.7]

Table 5.2: LER and BLEU scores with 95% confidence intervals for the refer-ence systems on the test corpora. The max-ent system has been trained using aword-aligned parallel corpus. The results in bold face show statistically significantimprovements for the maximum-entropy model compared to the best-coverage rulesaccording to pair-bootstrap resampling. The column labelled MOAT is the result ofchoosing the most-often-aligned translation in the parallel corpus.

Pair Metric SystemTLM Rules MaxEnt Oracle

br-fr (%) [44.2, 50.5] [44.3, 50.0] [40.8, 46.9] [0.0, 0.0] (%) [15.4, 17.3] [14.5, 16.3] [14.8, 16.6] [16.7, 18.6]

mk-en (%) [26.8, 30.5] [23.4, 27.0] [25.2, 28.8] [0.0, 0.0] (%) [30.7, 32.3] [29.1, 31.5] [29.1, 31.5] [30.9, 33.3]

eu-es (%) [38.8, 44.2] [38.5, 44.1] [40.9, 46.2] [0.0, 0.0] (%) [10.6, 12.6] [10.2, 12.1] [10.3, 12.2] [11.5, 13.5]

en-es (%) [15.1, 18.9] [10.3, 13.8] [10.4, 13.8] [0.0, 0.0] (%) [21.9, 23.8] [22.2, 24.1] [22.2, 24.1] [22.8, 24.7]

Table 5.3: LER and BLEU scores with 95% confidence intervals for the referencesystems on the test corpora. The max-ent system has been trained using fractionalcounts. The results in bold face show statistically significant improvements for themaximum-entropy model compared to the best-coverage rules according to pair-bootstrap resampling.

Chapter 6

Conclusions

6.1 SummaryThe aim of this thesis has been to improve the translation quality of shallow-transfer rule-based machine translation by including a module for context-based lexical selection. The main contributions are as follows:

• A rule formalism for writing lexical selection rules based on source-language local context.

• An efficient implementation of this formalism using finite-state trans-ducers, with an algorithm to calculate the best rule-coverage for aninput sentence.

• A generic learning approach for lexical-selection rules with both a su-pervised learning method, and an unsupervised learning method.

• A maximum-entropy approach to learning weights for the rules.

Regarding the formalism for writing lexical-selection rules and its im-plementation in a finite-state based lexical-selection module, we have shownthat it is possible for humans given little time, to write rules which havea positive influence on translation performance for three diverse languagepairs (Breton–French, Macedonian–English and Basque–Spanish). Whilethe performance for the rules in the English–Spanish pair did not improveperformance, neither did it decrease performance. Inclusion of such a lexical-selection module allows users of a rule-based MT system to easily fix lexical-selection errors manually and to easily adapt a system to a new domain.

In addition to showing that rules may be written manually, we havealso shown that it is possible to learn rules from parallel corpora. Ruleslearnt using this method provide substantial improvements in translationquality with regards lexical selection over all four language pairs consideredin this thesis, even with the smallest amount of parallel text — just over two

79

80 CHAPTER 6. CONCLUSIONS

thousand sentences for the Breton–French pair. The context rules learnt out-perform choosing the most-often aligned translation in all language pairs,but the improvement is only statistically significant (p = 0.95) in three outof the four language pairs (Breton–French, Basque–Spanish and English–Spanish).

As parallel corpora are not available for many language pairs, we alsopresent a novel, unsupervised learning method for learning lexical selectionrules. The training method works by translating all possible combinationsinto the target language using the rest of the modules of the MT system,and then using the normalised probabilities from a language model to replacethe alignment counts in the supervised learning method. In the best case(English–Spanish), the method gives a statistically significant improvementover simply picking online the best combination for each sentence accordingto the score returned by the target-language model. For the Macedonian–English and Basque–Spanish pairs, there is a non-significant improvementat p = 0.95, and in the worst case, for Breton–French, although the methoddoes not outperform the target-language model, it allows us to get close tothe same performance using only source-language, monolingual information.That it is possible to exceed the performance of the target-language modelusing only source-language information is a major contribution of this thesis.

The combination of best-coverage algorithm and rule-learning methodin Chapter 4 have two disadvantages: The first is that in order to includea rule, the translation it selects must appear more frequently than the de-fault translation a given number of times. In three out of four languagepairs (Breton–French, Basque–Spanish and English–Spanish) this leads todiscarding rules which pick an alternative translation less than three timesas frequently as the default translation. The second disadvantage is thatwhen applying the rules to an input sentence, shorter rules which may bemore reliable can be discarded in favour of longer, less reliable rules. Toovercome these disadvantages, in Chapter 5 we present a weighted versionof the lexical-selection module, which uses the principle of maximum en-tropy to learn weights for rules of the same type as described in Chapter 4.Using weighted rules (equivalent to weighted features in maximum-entropyterminology) with supervised training using a parallel corpus gives an im-provement for all language pairs, although the improvement is only statis-tically significant (p = 0.95) in three out of four systems (Breton–French,Macedonian–English and Spanish–English). As in Chapter 4 we also adaptthe supervised training to be unsupervised in the same way, using fractionalcounts. This only provides an improvement for the system with the small-est amount of training data, the Breton–French one. However, using themaximum-entropy training has an advantage is that allows us to dispensewith the annotated development corpus which was necessary to calculatethe rule-inclusion threshold.

6.2. FUTURE WORK 81

All of the lexical-selection rules in this thesis can co-exist in the samemodule. Thus, it is possible for hand-written rules to be used at the sametime as rules learnt from a corpus. We have shown improved lexical-selectionperformance over four language pairs with different typologies, and differentamounts of resources.

In Forcada et al. (2011), we stated that “no successful, efficient, general-purpose lexical selection module has been implemented yet [for Apertium]”.The work in this thesis, and the software released with it now makes thisstatement unnecessary. All of the software in this thesis is released asfree/open-source software under the terms of the GNU General Public Li-cence1, which ensures that the experiments are reproducible and allows otherresearchers to improve on them without having to reimplement the algo-rithms from scratch.

6.2 Future workThere are a number of possible future lines of research that could be exploredbased on the work done in this thesis:

1. In Chapter 3, one of the frequent complaints from volunteers writingthe rules was that it was not possible to write rules which took intoaccount a non-fixed context. That is, rules which look for a givenpattern at any position in the input sentence. Implementing this wouldinvolve looking at strategies for delimiting clause boundaries withouthaving an explicit syntactic parse.

2. If we assume that a set of rules provides a set of source-languagecontexts which distinguish between different translations, it should bepossible to take the same contexts and apply them to a different targetlanguage. For example, given a set of rules which distinguishes betweenthe different translations of the Spanish word estación in French (gare,station, saison), it should be possible to apply the same contexts todistinguish between (geltoki, estazio, urtaro) in Basque. One possibleapproach would be to use a bilingual dictionary between the two targetlanguages, but the question would remain of how to deal with theinter-target language ambiguity. This would also open the possibilityof using a large parallel corpus for Spanish–English to determine thecontexts of ambiguous words in Spanish before generating rules forother language pairs with Spanish which do not have a parallel corpus.

3. Although we have looked at four distinct language pairs with diversemorphological typologies, there is no pair where both languages are

1version 3.0 http://www.gnu.org/licenses/gpl.html

82 CHAPTER 6. CONCLUSIONS

morphologically complex. Any method described as language inde-pendent should provide similar improvements for any given languageregardless of differences in typology.

4. In this thesis we have only treated open-category words: nouns, adjec-tives and verbs. However, the methods described should be applicableto other categories, such as prepositions.

5. For the unsupervised monolingual learning method described in Chap-ter 4, there is currently no method of determining the rule-inclusionthreshold without an annotated development corpus. While the useof the combined probability looked promising, the evaluation showedthat using the rule-set that maximised probability on the developmentcorpus, actually led to a decrease in performance on the test corpus.Investigating MT quality estimation metrics (such as Specia et al.(2009)) which do not rely on parallel corpora may provide an optionfor finding the threshold without relying on an annotated corpus. How-ever, many estimation metrics take into account features which wouldremain static for lexical selection, such as sentence length, number oftranslations, number of content words, POS models etc., so furtherinvestigation would be required.

6. The weights generated using supervised training for the maximum-entropy method could be adjusted using minimum-error-rate training(MERT, Och (2003)) using either the or metrics on thedevelopment corpus.

7. One disadvantage of the unsupervised training for the maximum-entropymethod is that the number of features increases substantially as thenumber of alternative translations in the bilingual dictionary increases.Some of these alternative translations have such low weights that theywill never be picked. In order to improve the performance of the sys-tem, these features could be pruned (this would be related to worksuch as Johnson et al. (2007), where a phrase-table is pruned of statis-tically insignificant translation segments). Another possibility wouldbe to discard during training those disambiguation paths which appearoutside of a given threshold of the probability mass.

8. The monolingual learning method described in Chapter 4 could be in-tegrated with the unsupervised-learning of part-of-speech taggers de-scribed by Sánchez-Martínez et al. (2008) to learn the part-of-speechtagger and the lexical-selection module at the same time. We canexpect the part-of-speech tagger to benefit from alternative trans-lation paths, which may lead to a more probable translation, andthe lexical-selection module to benefit from alternative disambigua-tion paths leading to more fluent target-language output.

6.2. FUTURE WORK 83

9. As we saw in the introduction, morphological, syntactic and semanticinformation can be useful for lexical selection. However, this thesis hasfocussed on lexical information, with some morphological information.Incorporating other types of information into the unweighted rule setswould be difficult as the rule-inclusion threshold is unlikely to be ableto distinguish between rules that improve and rules that worsen lexical-selection accuracy. However the information could be included in themaximum-entropy method.

10. It remains to be seen to what extent the rules are domain-specific.However, it is likely, and future work could involve trying to distinguishdomain-specific rules from general ones. For example a rule which picks‘season’ as a translation of estació in the context of estació llarga mightbe a good rule for a general text, or a text discussing meteorology, butwhen translating a domain-specific text on railway stations, it may notbe adequate. However, choosing banco as a translation of ‘bank’ in thecontext investment bank is likely to be a good rule regardless of thetranslation domain.

11. In the unsupervised training it would also be interesting to try differenttarget-language models. In this thesis we have used a 5-gram modelof surface forms — as in preliminary experiments it performed best— but it would also be interesting to try using other language modelswith more or less structure.

Appendix A

Apertium: free/open-sourceshallow-transfer MT

A.1 Introduction

Apertium (Forcada et al., 2011) is an free/open-source platform for creatingshallow-transfer RBMT systems.1 The platform is being widely used to buildMT systems for a variety of language pairs, especially in those cases (mainlywith related-language pairs) where shallow transfer suffices to produce goodquality translations. It has, however, also proven useful in assimilation sce-narios with more distant pairs involved. As of November 2012, a total of 33language pairs have been released using the platform, and several more arecurrently under development.

The platform is designed to be: fast, in the order of thousands of wordsper second on a normal desktop computer; easy to develop; and standalone,no need for existing data or large parallel corpora to build a system.

The MT engine and tools in Apertium were not built from scratch, butare rather the result of a complete rewriting and extension of two previ-ous MT systems, namely the Spanish–Catalan MT system interNOSTRUM.com (Canals-Marote et al., 2001) and the Spanish–Portuguese MT systemtraductor.universia.net (Garrido-Alenda et al., 2004), both developedby the Transducens group at Universitat d’Alacant. The first version of thewhole system (Apertium level 1) was released on July 29, 2005, and closelyfollowed the architecture of those two non-free systems. An enhanced ver-sion of the engine (Apertium level 2) was released on December 22, 2006,featuring an extended implementation of the structural transfer of Aper-tium level 1 to generalise more complex transformations for the translationbetween less-related language pairs.

1This appendix is largely based on a paper by Forcada et al. (2011).

85

86APPENDIX A. APERTIUM: FREE/OPEN-SOURCE SHALLOW-TRANSFERMT

morph.

analyser

POS

tagger

lexical

transfer

morph.

generator

post-

generator

SL

text

TL

text

deformatter

reformatter

structural

transfer

lexical

transfer

lexical

selection

Figure A.1: The Apertium architecture. The lexical transfer module (shadowed)has been moved from being called from the structural transfer module to being amodule in its own right (in bold face) and the lexical selection module has beeninserted between lexical transfer and structural transfer.

A.2 Translation pipelineA.2.1 DeformatterThe deformatters encapsulates format information (such as HTML, XML,RTF) in the input as superblanks, that will then be seen as blanks betweenwords by the rest of the modules. These superblanks are started with a leftbracket [ and ended with a right bracket ]. For example the HTML text:

Arrain txikiak <i>lo egiten</i> dute.

would be processed by the deformatter as follows:

Arrain txikiak[ <i>]lo egiten[<\/i> ]dute.

Note that any reserved symbols $, /, etc. are also escaped by the defor-matter.

A.2.2 Morphological analyserThe morphological analyser segments the text in surface forms (SF) (words,or, where detected, multi-word lexical units or MWLUs) and delivers, foreach of them, one or more lexical forms (LF) consisting of lemma, lexicalcategory and morphological information. It reads a finite-state transducer(FST) compiled from a source-language (SL) morphological dictionary inXML. Tokenisation is not a trivial task due to morphological and ortho-graphical phenomena like clitics, contractions, multiword units and com-pounding (a special case of multiword units). The strategy for dealing withthese problems in Apertium is to process the input left-to-right longest match(described in Garrido-Alenda et al. (2002)). Processing the example in theprevious section, the analyser would return:

A.2. TRANSLATION PIPELINE 87

Ârrain/Arrain<n>$^txikiak/txiki<adj><izo>+a<det><art><pl>

/txiki<adj><izo>+a<det><art><sg>+k<post>$[ <i>]^lo egiten/lo egin<vblex><ger>$[<\/i> ]^dute/ukan<vbsint><pri><NR_HU><NK_HK>$^./.<sent>$[]

Note that two multiwords have been detected, lo egin ‘to sleep’ and ariizan ‘to be doing x’. An ambiguity has also been detected between -akas singular, ergative k<post>, and -ak as plural absolutive. The charactersˆ and $ delimit the analyses for each surface form; lexical forms for eachsurface form are separated by /; angle brackets <…> are used to delimit tags(grammatical symbols). The string between ˆ and the first / is the surfaceform.

A.2.3 Morphological disambiguationFor surface forms which return more than one possible lexical form, mor-phological disambiguation is necessary. This is accomplished with a part-of-speech tagger based on a first-order hidden Markov model (HMM: Cuttinget al. (1992)). This tagger is a statistical module which can be trainedin three ways: supervised, using maximum-likelihood on a tagged corpus,unsupervised using the Baum-Welch algorithm, and unsupervised using in-formation from the target language (Sánchez-Martínez et al., 2008). For ourprevious example the tagger would return:2

Ârrain<n>$^txiki<adj><izo>+a<det><art><pl>$[ <i>]^lo egin<vblex><ger>$[<\/i> ]ûkan<vbsint><pri><NR_HU><NK_HK>$^.<sent>$[]

Another option for morphological disambiguation which is used in somelanguage pairs (in this thesis, the apertium-br-fr and apertium-mk-enpairs) is constraint grammar (Karlsson et al., 1995). The integration of con-straint grammar, particularly the free/open-source VISL variant3 in Aper-tium is described in Tyers (2009a).

A.2.4 PretransferThe task of the pretransfer module is to split clitics into their constituentparts before being passed to the lexical transfer module.

2The sequence returned by the current tagger in the pair apertium-eu-es is: <n><adj><izo>+<det><art><sg>+<post> <vblex><ger> <vbsint><pri><NR_HK> <sent>, ithas been altered for the example to make it more adequate.

3http://beta.visl.sdu.dk/constraint_grammar.html


Ârrain<n>$^txiki<adj><izo>$ â<det><art><pl>$[ <i>]^lo egin<vblex><ger>$[<\/i> ]ûkan<vbsint><pri><NR_HU><NK_HK>$^.<sent>$[]

A.2.5 Lexical transferThe lexical transfer module reads each SL LF and delivers the correspondingtarget-language (TL) LF by looking it up in a bilingual dictionary encodedas an FST compiled from the corresponding XML file. For each SL LF, zeroor more TL LFs are returned.

Ârrain<n>/pescado<n><m><ND>/pez<n><m><ND>$^txiki<adj><izo>/pequeño<adj><GD><ND><izo>$â<det><art><pl>/el<det><def><GD><pl>$[ <i>]^lo egin<vblex><ger>/dormir<vblex><ger>$[<\/i> ]ûkan<vbsint><pri><NR_HU><NK_HK>/tener<vblex><pri><NR_HU><NK_HK>$^.<sent>/.<sent>$[]

A.2.6 Lexical selectionThe lexical selection module (as described in this thesis, see Chapter 3)reads the output of the lexical transfer module and applies a sequence of oneor more finite-state constraint rules in order to choose the most adequatetranslation in context. In the maximum-entropy approach these rules mayhave weights which are combined to choose the highest scoring translation.

Ârrain<n>/pez<n><m><ND>$^txiki<adj><izo>/pequeño<adj><GD><ND><izo>$â<det><art><pl>/el<det><def><GD><pl>$[ <i>]^lo egin<vblex><ger>/dormir<vblex><ger>$[<\/i> ]ûkan<vbsint><pri><NR_HU><NK_HK>/tener<vblex><pri><NR_HU><NK_HK>$^.<sent>/.<sent>$[]

This example presents the result of applying a hand-written rule. Therule chooses the translation pez (‘fish’ the animal) if two words to the rightthe verb lo egin ‘sleep’ is found.

A.2.7 Structural transferSince Apertium 2, the structural transfer module consists of three sub-modules. All of the pairs used in this thesis make use of all three modules.The first module is a chunker which performs local syntactic operations andsegments the sequence of lexical units into chunks delimited by braces { and

A.2. TRANSLATION PIPELINE 89

}. A chunk is defined as a fixed-length sequence of lexical categories that cor-responds to some syntactic feature such as a noun phrase or a prepositionalphrase.

^Pr_det_nom_adj<SN><art><m><pl>{êl<det><2><m><pl>$^pez<n><m><pl>$ ^pequeño<adj><m><pl>$}$

[ <i>]^vbconj<SV><vblex><p3><pl><NR_HU><NK_HK>{^dormir<vblex><pri><p3><pl>$}$[<\/i> ]

^punt<sent>{^.<sent>$}$[]

Following this, is the interchunk module which performs longer-rangeoperations with the chunks and between them. More than one interchunkmodule can be used in sequence to perform increasingly higher-level transfertransformations.

^Pr_det_nom_adj<SN><art><m><pl>{êl<det><2><m><pl>$^pez<n><m><pl>$ ^pequeño<adj><m><pl>$}$

[ <i>]^vbconj<SV><vblex><p3><pl><NR_HU><NK_HK>{^dormir<vblex><pri><p3><pl>$}$[<\/i> ]

^punt<sent>{^.<sent>$}$[]

And finally, a postchunk module which performs finishing operations oneach chunk and removes chunk encapsulations so that a plain sequence ofLFs is generated, suitable for morphological generation.

Êl<det><def><m><pl>$ ^pez<n><m><pl>$^pequeño<adj><m><pl>$[ <i>]^dormir<vblex><pri><p3><pl>$[<\/i> ]^.<sent>$[]

A.2.8 Morphological generatorThe morphological generator delivers a TL SF for each TL LF, by suitablyinflecting it. It reads an FST compiled from a TL morphological dictionaryin XML.

Los peces pequeños[ <i>]duermen[<\/i> ].[]

A.2.9 Post-generatorThe post-generator performs orthographic operations, such as contractions(e.g. Spanish a + el = al or Portuguese por + as = pelas), apostrophations(e.g. Catalan el + institut = l’institut) or epenthesis (e.g. English a +institute = an institute). This is done using an FST generated from a rulefile written in XML.


A.2.10 ReformatterThe reformatter performs the opposite task of the deformatter, strippingout the brackets, and unescaping reserved symbols.

Los peces pequeños <i>duermen</i> .

Appendix B

Software released as part ofthis thesis

B.1 apertium-lex-toolsAll the software described in this thesis is implemented in the apertium-lex-tools package, which has been released under the GNU GPL licenceversion 2.0 or later; it can be downloaded from http://sf.net/projects/apertium. The package provides both the processor lrx-proc, and thecompiler lrx-comp, of the rule-formalism described in Chapter 3. It alsoprovides a set of tools to generate lexical selection rules from parallel andmonolingual corpora. Generating rules from parallel corpora depends onthe free/open-source GIZA++ package (Och and Ney, 2003) to compute wordalignments. For generating rules from monolingual corpora, the free/open-source IRSTLM (Federico et al., 2008) toolkit is used.

A compilable version of the YASMET1 maximum-entropy toolkit is alsoincluded for the convenience of the user.

1http://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html

91

List of Figures

1 Resum dels resultats de tax d’error de selecció lèxica per alsparells anglès–castellà (dalt) i basc–castellà (baix). Per a unaexplicació de Ling, vegeu el capítol 2. Per a detalls sobreTLM (model de llengua meta), MLT (traducció més proba-ble), MRul (regles no supervisades), MOAT (traducció alin-eada més sovint) i PRul (regles supervisades) vegeu el capí-tol 4. I per una detalls sobre ME-M (regles no supervisadesamb pesos) i ME-P (regles supervisades amb pesos) vegeu elcapítol 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

2 Resum dels resultats de tax d’error de selecció lèxica per alsparells macedònic–anglès (dalt) i bretó–francès (baix). Peruna explicació de Ling, vegeu el capítol 2. Per a detallssobre TLM (model de llengua meta), MLT (traducció mésprobable), MRul (regles no supervisades), MOAT (traduccióalineada més sovint) i PRul (regles supervisades) vegeu elcapítol 4. I per detalls sobre ME-M (regles no supervisadesamb pesos) i ME-P (regles supervisades amb pesos) vegeu elcapítol 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1.1 The Vauquois pyramid (Vauquois, 1968) shows the differ-ent levels of abstraction in intermediate representation inrule-based machine translation. At the bottom of the pyra-mid is direct machine translation, and at the top interlingualmachine translation. Between these two, varying levels oftransfer-based machine translation. . . . . . . . . . . . . . . 4

1.2 Example of how a typical transfer-based MT system works.The source text is first converted into a source-language in-termediate representation (SL IR), which is then convertedby the transfer module into the target-language intermediaterepresentation (TL IR) and finally this target-language inter-mediate representation is generated by the target-languagegeneration module. . . . . . . . . . . . . . . . . . . . . . . . . 5

93

94 LIST OF FIGURES

2.1 The input sentence and the three sets of translations used forcalculating the lexical-selection error rate. The source sen-tence S = (s1, s2, . . . , s|S|) has one ambiguous word, estació.There is one difference between the reference set Tr(si) andthe test set Tt(si) of translations, thus the error rate for thissentence is 100%. . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 The rule in Table 3.1 expressed in XML format. Note that theempty <match/> tag matches any lexical unit in the sourcelanguage and performs the skip operation. . . . . . . . . . . 27

3.2 An example of lexical-selection rules for the Catalan–Englishpair in the XML format. The rules were written by hand tochoose alternative translations of the ambiguous word estació‘station(default), season, resort’. The order of rules is notimportant for their application. . . . . . . . . . . . . . . . . . 28

3.3 A finite-state transducer representing some of the lexical-selection rules described in Figure 3.2; The representationhas been simplified by replacing a series of letter-transitionswith a single word transition. The numerals before the finalstate are the rule identifiers, used for tracing rule application. 29

3.4 A hand-written rule selecting suponer as a translation of eman‘give, suppose, …’. The rule has a six wildcards to allowintervening subordinate clause of six words. In the Basque–Spanish rules there were seven rules of this type for the sameselection, each with a different number of <match/> tags. Theexample sentence may be translated as ‘Let us suppose thatA is a point in space.’, where ‘that’ is translated by -la andthe wildcards match the intervening clause. . . . . . . . . . . 32

3.5 The rule from Figure 3.1 rewritten in Constraint Grammar(CG) formalism. The angle brackets are used to identify thelemmas in the source language. The numerals indicate rela-tive positions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Example of the output of the lexical-transfer module of ourBasque–Spanish machine translation system for two SL sen-tences from our parallel corpus in Table 4.2. Before beingpassed to the lexical-transfer module, the input is tokenised,lemmatised and tagged for part of speech. The presence of ahyphen in front of a lemma (for example -ko) indicates that itis a clitic, which in the orthography is attached to the previ-ous word. Where a word is underlined, it indicates that thisis an appropriate selection in the given context. . . . . . . . . 42

LIST OF FIGURES 95

4.2 Graphs showing the evolution of the lexical-selection errorrate for the English–Spanish pair on the development corpusas the rule-inclusion threshold increases. The number of pos-sible values of θ is small enough (fewer than 400) to be able tocarry out an exhaustive search to find the optimal value. Inthis case, the optimal value of θ is between 2.5 and 2.8, yield-ing a of 7.1% compared to 8.2% using the most-often-aligned translation. The second graph is a magnification ofthe first part of the first graph. . . . . . . . . . . . . . . . . . 46

4.3 Graph showing the evolution of the lexical-selection error ratefor the English–Spanish pair on the held-out development cor-pus as a function of the rule-inclusion threshold decreases.The number of possible values of θ is too high to be able tocarry out an exhaustive search to find the optimal value. Wesample in steps of 21/8. In the case of this pair, the optimalvalue of θ is between 200,000 and 600,000, yielding a of10.8% compared to 11.0% using the most-likely translation.In this case the improvement using rules with respect to usingthe most-likely translation is not significant at p = 0.95. . . . 51

4.4 English–Spanish. The first graph shows how the error ratereduces as more parallel data is used for training. The twocurves show the effect of using only rules based on only themost-often-aligned translation, and using the same rules incombination with rules incorporating context. The two hori-zontal lines are the linguistic defaults and the target-languagemodel best. The error bars show the 95% confidence inter-vals. The second graph shows the coverage of the contextrules and the most-often-aligned translation. . . . . . . . . . . 55

4.5 Basque–Spanish. The first graph shows how the error ratereduces as more parallel data is used for training. The twocurves show the effect of using only rules based on only themost-often-aligned translation, and using the same rules incombination with rules incorporating context. The two hori-zontal lines are the linguistic defaults and the target-languagemodel best. The error bars show the 95% confidence inter-vals. The second graph shows the coverage of the contextrules and the most-often-aligned translation. . . . . . . . . . . 56

96 LIST OF FIGURES

4.6 Macedonian–English. The first graph shows how the er-ror rate reduces as more parallel data is used for training.The two curves show the effect of using only rules based ononly the most-often-aligned translation, and using the samerules in combination with rules incorporating context. Thetwo horizontal lines are the linguistic defaults and the target-language model best. The error bars show the 95% confidenceintervals. The second graph shows the coverage of the contextrules and the most-often-aligned translation. . . . . . . . . . . 57

4.7 Breton–French. The first graph shows how the error ratereduces as more parallel data is used for training. The twocurves show the effect of using only rules based on only themost-often-aligned translation, and using the same rules incombination with rules incorporating context. The two hori-zontal lines are the linguistic defaults and the target-languagemodel best. The error bars show the 95% confidence inter-vals. The second graph shows the coverage of the contextrules and the most-often-aligned translation. . . . . . . . . . . 58

4.8 English–Spanish. The first graph shows error rate reducingas more monolingual training data is added. The two curvesshow the effect of using only rules based selecting the most-likely translation calculated from the fractional counts fromthe TLM, and using the same rules in combination with rulesincorporating context. The two horizontal lines are two ofthe reference results. Error bars show the 95% confidenceintervals. The second graph shows the coverage of the contextrules and the most-likely translation. . . . . . . . . . . . . . . 59

4.9 Basque–Spanish. The first graph shows error rate reducingas more monolingual training data is added. The two curvesshow the effect of using only rules based selecting the most-likely translation calculated from the fractional counts fromthe TLM, and using the same rules in combination with rulesincorporating context. The two horizontal lines are two ofthe reference results. Error bars show the 95% confidenceintervals. The second graph shows the coverage of the contextrules and the most-likely translation. . . . . . . . . . . . . . . 60

LIST OF FIGURES 97

4.10 Macedonian–English. The first graph shows error rate re-ducing as more monolingual training data is added. The twocurves show the effect of using only rules based selecting themost-likely translation calculated from the fractional countsfrom the TLM, and using the same rules in combination withrules incorporating context. The two horizontal lines are twoof the reference results. Error bars show the 95% confidenceintervals. The second graph shows the coverage of the contextrules and the most-likely translation. . . . . . . . . . . . . . . 61

4.11 Breton–French. The first graph shows error rate reducingas more monolingual training data is added. The two curvesshow the effect of using only rules based selecting the most-likely translation calculated from the fractional counts fromthe TLM, and using the same rules in combination with rulesincorporating context. The two horizontal lines are two ofthe reference results. Error bars show the 95% confidenceintervals. The second graph shows the coverage of the contextrules and the most-likely translation. . . . . . . . . . . . . . . 62

4.12 Lexical-selection error rate (top) and BLEU results (bottom)for the parallel-corpus learning method, along with referenceresults for the English–Spanish pair. This can be seen as asummary of Figure 4.4. The error bars show the 0.95 confi-dence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.13 Lexical-selection error rate (top) and BLEU results (bottom)for the parallel-corpus learning method, along with referenceresults for the Basque–Spanish pair. This can be seen as asummary of Figure 4.5. The error bars show the 0.95 confi-dence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.14 Lexical-selection error rate (top) and BLEU results (bottom)for the parallel-corpus learning method, along with referenceresults for the Macedonian–English pair. This can be seenas a summary of Figure 4.6. The error bars show the 0.95confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . 65

4.15 Lexical-selection error rate (top) and BLEU results (bottom)for the parallel-corpus learning method, along with referenceresults for the Breton–French pair. This can be seen as a sum-mary of Figure 4.7. The error bars show the 0.95 confidenceintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.16 Lexical-selection error rate (top) and BLEU results (bottom)for the monolingual-corpus learning method, along with refer-ence results for the English–Spanish pair. This can be seen asa summary of Figure 4.8. The error bars show the 0.95 con-fidence intervals. Note the difference between the TLM andTLM-Rules systems. . . . . . . . . . . . . . . . . . . . . . . . 67

98 LIST OF FIGURES

4.17 Lexical-selection error rate (top) and BLEU results (bottom)for the monolingual-corpus learning method, along with ref-erence results for the Basque–Spanish pair. This can be seenas a summary of Figure 4.9. The error bars show the 0.95confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . 68

4.18 Lexical-selection error rate (top) and BLEU results (bottom)for the monolingual-corpus learning method, along with ref-erence results for the Macedonian–English pair. This can beseen as a summary of Figure 4.10. The error bars show the0.95 confidence intervals. . . . . . . . . . . . . . . . . . . . . . 69

4.19 Lexical-selection error rate (top) and BLEU results (bottom)for the monolingual-corpus learning method, along with ref-erence results for the Breton–French pair. This can be seenas a summary of Figure 4.11. The error bars show the 0.95confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . 70

5.1 The maximum-entropy rule-application process. The featuresused are the contexts from the rules learnt in the previouschapter. The weights for each translation are summed andwhen a position in the sentence is reached where q0 is theonly alive state, the translation with the highest combinedweights is picked. For the word arrain, the translation pez0.8 + 0.9 + 1.7 + 1.8 = 5.2 is chosen over pescado 1.3 + 1.3 +1.0+1.2 = 4.8, and for the word handi, the translation grande1.2 is chosen over capaz 0.4. . . . . . . . . . . . . . . . . . . . 75

A.1 The Apertium architecture. The lexical transfer module (shad-owed) has been moved from being called from the structuraltransfer module to being a module in its own right (in boldface) and the lexical selection module has been inserted be-tween lexical transfer and structural transfer. . . . . . . . . . 86

List of abbreviations

BLEU bilingual evaluation understudy

CBMT corpus-based machine translation

EBMT example-based machine translation

IR intermediate representation

LER lexical-selection error rate

MT machine translation

MaxEnt maximum entropy

PBSMT phrase-based statistical machine translation

RBMT rule-based machine translation

SL source language

SMT statistical machine translation

SVN subversion versioning system

TL target language

WSD word-sense disambiguation

XML extensible markup language

99

Index of symbols

S a sentence in the source language . . . . . . . . . . . . . . . . . . 18Ts(si) function that returns all possible translations of a

word si according to the bilingual dictionary . . . . . . .18

Tr(si) function that returns the set of reference translationswhich are acceptable for si in sentence S . . . . . . . . . .

18

Tt(si) function that returns the set of translations selectedby the lexical-selection module . . . . . . . . . . . . . . . . . . . . .

18

amb(si) function that tests if word si is ambiguous accordingto the bilingual dictionary of the MT system . . . . . .

19

diff(Tr(si), Tt(si)) function that tests if the word in the test sentenceTt(si) is in the set of reference translations Tr(si) . .

19

x a source-language pattern . . . . . . . . . . . . . . . . . . . . . . . . . 26c a context in the source language made up of a se-

quence of patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

y an instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26d a target-language pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 26u an operation (y, d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26U a sequence of operations u . . . . . . . . . . . . . . . . . . . . . . . . . 26r lexical-selection rule (c, U) or (c, U, λ) . . . . . . . . . . . . . . 26R set of rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Q set of states in the rule transducer . . . . . . . . . . . . . . . . . 27Σ set of input symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Γ set of output symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27L the alphabet of transition labels L = Σ× Γ . . . . . . . . 27δ the transition function δ : Q× V → Q . . . . . . . . . . . . . 27q0 the initial state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

101

102 BIBLIOGRAPHY

qF the final state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27A the set of alive states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27M a map containing the best coverage for a given input

position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

ξ ratio of the alternative translation to the defaulttranslation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

C set of SL n-gram contexts . . . . . . . . . . . . . . . . . . . . . . . . . 40V set of ambiguous SL words for which a translation

has been seen in the corpus . . . . . . . . . . . . . . . . . . . . . . . .40

θ threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40ζ is the default translation . . . . . . . . . . . . . . . . . . . . . . . . . . 40v a single word from V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40t translation of word v in context c . . . . . . . . . . . . . . . . . . 40a set of word alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41T sequence of words in TL . . . . . . . . . . . . . . . . . . . . . . . . . . . 41G samples of (S, T, a) (supervised training) and (S,G)

(unsupervised training) . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

G set of possible lexical-selection paths of sentence S . 45g sequence of lexical-selection choices of SL words . . . 45τ(gi, S) function that returns a complete translation of path

gi of sentence S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

p(gi|S) normalised probability of path gi of sentence S ac-cording to the TL model . . . . . . . . . . . . . . . . . . . . . . . . . .

47

PTL probability according to the TL model . . . . . . . . . . . . . 47ps(t, c) the probability of TL word t being the translation of

SL word s in context c . . . . . . . . . . . . . . . . . . . . . . . . . . . .74

hs(t, c) a feature which selects the translation t of word s incontext c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

λs a feature weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Z normalising constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography

Alegria, I., de Ilarraza, A. D., Labaka, G., Lersundi, M., Mayor, A., Sarasola,K., Forcada, M. L., Ortiz-Rojas, S., and Padró, L. (2005). An OpenArchitecture for Transfer-based Machine Translation between Spanish andBasque. In Proceedings of Machine Translation Summit X, pages 7–14.

Arnold, D. (2003). Why translation is difficult for computers. In Somers,H., editor, Computers and Translation: A translator’s guide. BenjaminsTranslation Library.

Berger, A., Pietra, S. D., and Pietra, V. D. (1996). A maximum en-tropy approach to natural language processing. Computational Linguistics,22(1):39–71.

Bick, E. (2007). Dan2eng: Wide-Coverage Danish-English Machine Trans-lation. In Proceedings of Machine Translation Summit XI, pages 37–43.

Brandt, M. D., Loftsson, H., Sigurþórsson, H., and Tyers, F. M. (2011).Apertium-IceNLP: A rule-based Icelandic to English machine translationsystem. In Proceedings of the 16th Annual Conference of the EuropeanAssociation for Machine Translation, EAMT11, pages 217–224.

Brill, E. (1995). Transformation-based error-driven learning and natural lan-guage processing: A case study in part of speech tagging. ComputationalLinguistics, 21(4):543–565.

Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L.(1993). The mathematics of statistical machine translation: Parameterestimation. Computational Linguistics, 19(2):263–311.

Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J.(2008). Further meta-evaluation of machine translation. In Proceedings ofthe Third Workshop on Statistical Machine Translation, pages 70–106.

Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia,L. (2012). Findings of the 2012 workshop on statistical machine trans-lation. In Proceedings of the Seventh Workshop on Statistical MachineTranslation, pages 10–51.

103

104 BIBLIOGRAPHY

Callison-Burch, C., Osborne, M., and Koehn, P. (2006). Re-evaluating therole of BLEU in machine translation research. In 11th Conference ofthe European Chapter of the Association for Computational Linguistics:EACL 2006, pages 249–256.

Canals-Marote, R., Esteve-Guillen, A., Garrido-Alenda, A., Guardiola-Savall, M., Iturraspe-Bellver, A., Montserrat-Buendia, S., Ortiz-Rojas, S.,Pastor-Pina, H., Perez-Antón, P., and Forcada, M. (2001). The Spanish-Catalan machine translation system interNOSTRUM. In Proceedings ofMT Summit VIII: Machine Translation in the Information Age, pages73–76, Santiago de Compostela, Spain.

Carbonell, J., Klein, S., Miller, D., Steinbaum, M., Grassiany, T., and Frei,J. (2006). Context-based machine translation. In Proceedings of the 7thConference of the Association for Machine Translation in the Americas,“Visions for the Future of Machine Translation”, pages 19–28.

Carl, M. and Way, A., editors (2003). Recent Advances in Example-BasedMachine Translation, volume 21. Springer.

Carpuat, M. and Wu, D. (2005). Evaluating the word sense disambigua-tion performance of statistical machine translation. In Proceedings of theSecond International Joint Conference on Natural Language Processing(IJCNLP), pages 122–127.

Carpuat, M. and Wu, D. (2007). Improving statistical machine translationusing word sense disambiguation. In Proceedings of the Joint Conferenceon Empirical Methods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL 2007), pages 61–72.

Chan, Y. S. and Ng, H. T. (2007). Word sense disambiguation improvesstatistical machine translation. In Proceedings of the 45th Annual Meetingof the Association for Computational Linguistics (ACL-07), pages 33–40.

Chen, Y., Eisele, A., Federmann, C., Jellinghaus, M., and Theison, S.(2007). D 6.1: Improved confidence estimation and hybrid architec-tures for machine translation. Technical Report D 6.1, EuroMatrix.http://euromatrix.net/deliverables/deliverable61.pdf.

Chiang, D. (2007). Hierarchical phrase-based translation. ComputationalLinguistics, 33(2):201–228.

Coughlin, D. (2003). Correlating automated and human assessments ofmachine translation quality. In Proceedings of MT Summit IX, pages 23–27.

BIBLIOGRAPHY 105

Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (1992). A practicalpart-of-speech tagger. In Third Conference on Applied Natural LanguageProcessing. Association for Computational Linguistics. Proceedings of theConference., pages 133–140, Trento, Italy.

Dagan, I. and Itai, A. (1994). Word sense disambiguation using a secondlanguage monolingual corpus. Computational Linguistics, 20:563–596.

Dandapat, S., Forcada, M. L., Groves, D., Penkale, S., Tinsley, J., and Way,A. (2010). OpenMaTrEx: A Free/Open-Source Marker-Driven Example-Based Machine Translation System. In Advances in Natural LanguageProcessing: 7th International Conference on NLP, IceTAL 2010, pages121–126.

Denkowski, M. and Lavie, A. (2012). Challenges in predicting machine trans-lation utility for human post-editors. In Proceedings of the 10th Conferenceof the Association for Machine Translation in the Americas, pages 40–49.

Doddington, G. (2002). Automatic evaluation of machine translation qualityusing n-gram co-occurrence statistics. In Proceedings of the second inter-national conference on Human Language Technology Research, HLT ’02,pages 138–145, San Francisco, CA, USA. Morgan Kaufmann PublishersInc.

Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap.CRC Press.

Federico, M., Bertoldi, N., and Cettolo, M. (2008). IRSTLM: an open sourcetoolkit for handling large scale language models. In Proceedings of Inter-speech, Brisbane, Australia, pages 1618–1621.

Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas,S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., andTyers, F. M. (2011). Apertium: a free/open-source platform for rule-based machine translation. Machine Translation, 25(2):127–144.

Gale, W. A., Church, K. W., and Yarowsky, D. (1991). One sense perdiscourse. In Proceedings of the workshop on Speech and Natural Language,HLT ’91, pages 233–237.

Garrido-Alenda, A., Forcada, M. L., and Carrasco, R. C. (2002). Incremen-tal construction and maintenance of morphological analysers based onaugmented letter transducers. In Proceedings of 9th International Con-ference on Theoretical and Methodological Issues in Machine Translation,pages 53–62, Keihanna, Japan.

106 BIBLIOGRAPHY

Garrido-Alenda, A., Zarco, P. G., Pérez-Ortiz, J., Pertusa-Ibáñez, A.,Ramírez-Sánchez, G., Sánchez-Martínez, F., Scalco, M., and Forcada, M.(2004). Shallow parsing for Portuguese-Spanish machine translation. InBranco, A., Mendes, A., and Ribeiro, R., editors, Language technology forPortuguese: shallow processing tools and resources, pages 135–144. Lisboa.

Ginestí-Rosell, M., Ramírez-Sánchez, G., Ortiz-Rojas, S., Tyers, F. M., andForcada, M. L. (2009). Development of a free Basque to Spanish machinetranslation system. Procesamiento de Lenguaje Natural, (43):185–197.

Han, C., Xia, F., Palmer, M., and Rosenzweig, J. (1996). Capturing lan-guage specific constraints on lexical selection with feature-based lexicalisedtree-adjoining grammars. In Proceedings of International Conference onChinese Computing ’96, pages 1–9.

Her, O., Higinbotham, D., and Pentheroudakis, J. (1994). Lexical and Id-iomatic Transfer in Machine Translation: An LFG approach. Research inHumanities Computing, 3:200–216.

Hutchins, W. J. and Somers, H. L. (1992). An Introduction to MachineTranslation. Academic Press, London, UK.

Ide, N. and Véronis, J. (1998). Word sense disambiguation: The state of theart. Computational Linguistics, 24(1):1–41.

Jian, S., Jin, G., and Cheong, T. L. (1999). Target Word Selection with Co-occurrence and Translation Information. In Proceedings of MT SummitVII, pages 412–416.

Johnson, H., Martin, J., Foster, G., and Kuhn, R. (2007). Improving trans-lation quality by discarding most of the phrasetable. pages 967–975.

Karlsson, F., Voutilainen, A., Heikkilä, J., and Anttila, A. (1995). ConstraintGrammar: A language independent system for parsing unrestricted text.Mouton de Gruyter.

Koehn, P. (2004). Statistical significance tests for machine translation eval-uation. In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, pages 388–395.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine trans-lation. In Proceedings of the 10th MT Summit, pages 79–86.

Koehn, P. (2010). Statistical Machine Translation. Cambridge UniversityPress.

Koehn, P., Birch, A., and Steinberger, R. (2009). 462 machine translationsystems for europe. pages 56–64.

BIBLIOGRAPHY 107

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi,N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O.,Constantin, A., and Herbst, E. (2007). Moses: Open source toolkit forstatistical machine translation. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL), demonstration session.

Koehn, P. and Knight, K. (2000). Estimating Word Translation Probabil-ities from Unrelated Monolingual Corpora Using the EM Algorithm. InProceedings of the Seventeenth National Conference on Artificial Intelli-gence (AAAI-00), pages 711–715.

Koehn, P. and Knight, K. (2001). Knowledge sources for word-level transla-tion models. In Lee, L. and Harman, D., editors, Proceedings of the 2001Conference on Empirical Methods in Natural Language Processing, pages27–35.

Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based trans-lation. In Proceedings of HLT-NAACL, pages 127–133.

Lavie, A. and Denkowski, M. J. (2009). The Meteor metric for automaticevaluation of machine translation. Machine Translation, 23(2-3):105–115.

Manning, C. D. and Schütze, H. (1999). Foundations of Statistical NaturalLanguage Processing. MIT Press.

Melero, M., Oliver, A., Badia, T., and Suñol, T. (2007). Dealing with bilin-gual divergences in mt using target language n-gram models. In Proceed-ings of the METIS-II Workshop: New Approaches to Machine Translation,CLIN17, pages 19–26.

Mitamura, T., Nyberg, E. H., and Carbonell, J. G. (1991). An efficientinterlingua translation system for multi-lingual document production. InProceedings of Machine Translation Summit III, pages 2–4.

Nagao, M. (1984). A framework of a mechanical translation betweenJapanese and English by analogy principle. Artificial and Human In-telligence, pages 173–180.

Och, F. J. (2003). Minimum error rate training in statistical machine trans-lation. In Proceedings of the 41st Annual Meeting of the Association forComputational Linguistics, July 2003, pages 160–167.

Och, F. J. and Ney, H. (2003). A systematic comparison of various statisticalalignment models. Computational Linguistics, 29(1):19–51.

Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). ”BLEU: amethod for automatic evaluation of machine translation. In ACL-2002:40th Annual meeting of the Association for Computational Linguistics,pages 311–318.

108 BIBLIOGRAPHY

Phillips, A. B. (2011). Cunei: Open-source machine translation withrelevance-based models of each translation instance. Machine Transla-tion, 25(2):161–177.

Rangelov, T. and Tyers, F. M. (2011). Rule-based machine translation be-tween Bulgarian and Macedonian. In Proceedings of the Second Interna-tional Workshop on Free/Open-Source Rule-Based Machine Translation,pages 53–61.

Roche, E. and Schabes, Y. (1997). Finite-State Language Processing. MITPress.

Sánchez-Martínez, F., Forcada, M. L., and Way, A. (2009). Hybrid rule-based – example-based MT: Feeding Apertium with sub-sentential trans-lation units. In Proceedings of the 3rd Workshop on EBMT, pages 11–18.

Sánchez-Martínez, F., Pérez-Ortiz, J. A., and Forcada, M. L. (2008). Usingtarget-language information to train part-of-speech taggers for machinetranslation. Machine Translation, 22(1-2):29–66. DOI: 10.1007/s10590-008-9044-3.

Sarrionandia, J. (1995). Marinel zaharraren balada. Pamiela.

Scannell, K. (2007). The Crúbadán Project: Corpus building for under-resourced languages. In Proceedings of the 3rd Web as Corpus Workshop(WAC3).

Scott, B. and Barreiro, A. (2009). OpenLogos MT and the SAL Repre-sentation Language. In Proceedings of the First International Workshopon Free/Open-Source Rule-Based Machine Translation (FREERBMT09),pages 19–26.

Specia, L., das Graças V. Nunes, M., and Stevenson, M. (2005a). Exploitingrules for word sense disambiguation in machine translation. Procesamientodel Lenguaje Natural, 35:171–178.

Specia, L., Oliveira-Neto, S., Nunes, M. G. V., and Stevenson, M. (2005b).An automatic approach to create a sense tagged corpus for word sensedisambiguation. In Proceedings of the 2nd Meaning Workshop, pages 31–36.

Specia, L., Wang, Z., Turchi, M., Shawe-Taylor, J., and Saunders, C. (2009).Improving the confidence of machine translation quality estimates. InProceeedings of Machine Translation Summit XII, pages 152–160.

Sánchez-Martínez, F., Pérez-Ortiz, J. A., and Forcada, M. L. (2007). In-tegrating corpus-based and rule-based approaches in an open-source ma-chine translation system. In Proceedings of METIS-II Workshop: New

BIBLIOGRAPHY 109

Approaches to Machine Translation, a workshop at CLIN 17 - Computa-tional Linguistics in the Netherlands, pages 73–82.

Tiedemann, J. and Nygård, L. (2004). The OPUS corpus - parallel andfree. In Proceedings of the Fourth International Conference on LanguageResources and Evaluation (LREC’04), pages 1183–1186.

Tyers, F. M. (2009a). Design and implementation of a Welsh–English ma-chine translation system. Master’s thesis, Universitat d’Alacant.

Tyers, F. M. (2009b). Rule-based augmentation of training data for Breton–French statistical machine translation. Proceedings of the 13th Conferenceof the European Association for Machine Translation, pages 213–218.

Tyers, F. M. (2010). Rule-based breton to french machine translation. InProceedings of the 14th Annual Conference of the European Associationfor Machine Translation, pages 174–181.

Tyers, F. M. and Alperen, M. S. (2010). SETimes: A parallel corpus ofBalkan languages. In Workshop on Exploitation of multilingual resourcesand tools for Central and (South) Eastern European Languages at theLanguage Resources and Evaluation Conference, pages 1–5.

Tyers, F. M., Sánchez-Martínez, F., and Forcada, M. L. (2012). Flexiblefinite-state lexical selection for rule-based machine translation. In Pro-ceedings of the 16th Annual Conference of the European Association forMachine Translation, pages 213–220, Trento, Italy.

Vauquois, B. (1968). A survey of formal grammars and algorithms for recog-nition and transformation in mechanical translation. In IFIP Congress (2),pages 1114–1122.

Vickrey, D., Biewald, L., Teyssier, M., and Koller, D. (2005). Word-sensedisambiguation for machine translation. In Proceedings of Human Lan-guage Technology Conference and Conference on Empirical Methods inNatural Language Processing, pages 771–778.

Wu, H. and Wang, H. (2007). Pivot language approach for phrase-basedstatistical machine translation. Machine Translation, 21(3):165–181.

Yang, J. (1999). Towards the Automatic Acquisition of Lexical SelectionRules. In Proceedings of Machine Translation Summit VIII, pages 397–403.

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivallingsupervised methods. In Proceedings of the 33rd annual meeting on Asso-ciation for Computational Linguistics, pages 189–196.

110 BIBLIOGRAPHY

Zens, R., Och, F. J., and Ney, H. (2002). Phrase-based statistical machinetranslation. In KI-2002: Advances in Artificial Intelligence. 25. AnnualGerman Conference on AI, KI 2002, pages 18–32.

Zhang, Y. and Vogel, S. (2004). Measuring confidence intervals for themachine translation evaluation metrics. In Proceedings of the 10th Inter-national Conference on Theoretical and Methodological Issues in MachineTranslation, pages 85–94.

Zinovjeva, N. (2000). Learning sense disambiguation rules for machine trans-lation. Master’s thesis, Uppsala University.

francis morton tyers - rua: principal · 2016-04-28 · totes les possibles traduccions, i utilitza...

Documents