taus open source machine translation showcase, seattle, language processing techniques for...

17
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE Integration of Advanced Language Processing Techniques into Statistical Machine Translation 11:10-11:30 Wednesday, 17 October Diego Bartolome Tauyou

Upload: taus-enabling-better-translation

Post on 18-May-2015

680 views

Category:

Technology


2 download

DESCRIPTION

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCore

TRANSCRIPT

Page 1: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

Integration of Advanced Language Processing Techniques into Statistical Machine Translation

11:10-11:30Wednesday, 17 October

Diego BartolomeTauyou

Page 2: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Language Processing Techniques

for

Statistical Machine Translation

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 3: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

To start ...

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 4: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

… you choose Moses ...

Translation memories + linguistic assets

Cleaning and training following tutorials

BLEU score seems ok in training

… but ...

the results are awful!

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 5: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Why?

Not enough data

Unclean translation memories

Misalignments

Spelling and grammar errors

Difficult language pairs

Selection of wrong parameters

Application of suboptimal techniques

So many things … what can you do?

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 6: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 7: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Some steps

Maximum exploitation of existing assets

Source content optimization

Data selection and cleaning

Improvement of the models

Linguistic processing

...

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 8: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Existing assets: increase TM leverage

Translation memory sharing

Clients, Partners, Competitors, EU, UN, TAUS

Relevant on-line data retrieval

Advanced TM techniques

Sub-segment matching

Parts of Speech replacement

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 9: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Source optimization (I): Pre-editing

newdoc

proposeddoc + html

report

Spell check

Grammar check

Style check

Terminology check

Client checklist

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 10: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Source optimization (II): Summarization

newdoc

proposeddoc + html

report

% to reduce

Use translation memories

Project

Client

All

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 11: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Summarization example

http://www.translationautomation.com/press-releases/free-open-source-machine-translation-tutorial-is-made-available-by-taus

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 12: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Data selection and cleaning – a sample

Clean translation memories

Length, punctuation, terminology, repetitions …

Segment splitting

Optimize weight of most frequent n-grams in corpus

Validate their translations

Add out-of-domain data for irrelevant n-grams

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 13: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Models optimization

Filter the translation tables

Remove the garbage + tune the weights if necessary

Optimize language models

Adapt them to the translation purpose

Tune parameters correctly

Tune set, test set, optimization parameters …

Improve recasing

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 14: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Linguistic processing

In the source and/or target language

Grammar checking

Entities detection

proper nouns, alphanumeric words, numbers, ...

Compund words splitting

Sentence reordering

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96

Page 15: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

An example fromSourceXXX 335102 doses are calculated as a free acid of the sodium salt (NA).The potential toxicity of XXX 335102 was studied in a number of acute toxicity studies in mouse and ratand repeat dose toxicity studies of 8 and 32 weeks each in rat and monkeys.XXX 335102 was negative in a panel of in vivo and in vitro tests to assess mutagenicity andclastogenicity identifying no genotoxic risks for human subjects.An in vitro assay for phototoxic potential suggested that XXX 335102 is photoxic/photosensitive.In the 8-week studies in monkeys, increases in unconjugated bilirubin were noted at the doses tested(33, 88, 192 and 444mg/kg/day); the greatest increases occurring at Week 4 and declining or returningto control levels by Week 8.

ReferenceLas dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS).

La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones yratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutageniay clastogenia, sin identificarse riesgos genotóxicos para el ser humano.

En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico ofotosensibilizador.

En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con lasdosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 ydisminuyendo o volviendo a los niveles de control en la semana 8.

Page 16: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Generic engineXXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA).La Toxicidad potencial de XXX 335102 fue estudiado en una serie de estudios de toxicidad aguda enratón y rata y vuelva a dosis estudios de toxicidad, de 8 y de 32 semanas en rata y cada uno de losmonos.

XXX 335102 era negativo en un grupo de in vivo y pruebas in vitro para evaluar mutagenicidad ygenotóxicas clastogenicity no identificar los riesgos para los participantes humanos.Un para fines de ensayo in vitro phototoxic potencial se sugirió que XXX 335102photoxic/Photosensitive.

En Los 8 -week estudios en los monos, aumentos en unconjugated bilirrubina salieron a las dosisanalizada (33, 88, 192 y 444 mg/kg/día); los mayores incrementos habidos En la semana 4 y lareducción o devolver a nivel de control de 8 Por semana.

Medical engine with improvementsLas dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS).La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones yratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia yclastogenia, sin identificarse riesgos genotóxicos para el ser humano.En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico ofotosensibilizador.

En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosisestudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo ovolviendo a los niveles de control en la semana 8.

Page 17: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

Conclusions

MT can be combined with other advanced techniques

Creating an improving an engine requires time

You can also be lucky at the first try!

The optimum results require translators

Implementation of the linguistic knowledge

Continuous improvement

Contact: Diego Bartolome – [email protected]/ Les Planes 39, 1o 2a – 08201 Sabadell – SpainTel. +34 93 711 29 96