transforming parallel corpora to translation memory steve legrand ipn 29th sept. 2006
TRANSCRIPT
![Page 1: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/1.jpg)
Transforming Parallel Corpora to Translation Memory
Steve Legrand
IPN
29th Sept. 2006
![Page 2: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/2.jpg)
Parallel text or bitext
Aligned translation of text from one language to another.
Practical uses in NLP:- Word sense disambiguation- Automatic translation- Translation memoriesTranslation memories
![Page 3: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/3.jpg)
Translation Memory
Helps the translator by using already translated text segments to cue in the translation of new text segments
Translation memory correspondence level can usually be set (e.g., 56%)
Automatic translation can be combined with translation memories post-editing of automatic translation for translation memory uses.
![Page 4: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/4.jpg)
Translation memory format (.tmx)
.tmx (translation memory exchange) is a standardized format for application interoperability.
tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid).
tuv: translation unit variant, unit that contains the language code of the translation (xml:lang).
seg: segment, it contains the translated text.
![Page 5: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/5.jpg)
TMX Example
![Page 6: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/6.jpg)
Poor man’s guide to translation memories
Trados the best known and probably one of the best commercial TM applications available.
There are cheaper one-user versions, but in spite of that the price is often prohibitive.
To avoid excessive costs, one could:– Use a demo versions of the commercial
software– Use Open Source products.
![Page 7: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/7.jpg)
OmegaT
Open Source translation memory Needs Java Run-time Needs Open Office to convert .doc format
to .odt or .swx- format (open standard) Creates tmx.files Tmx-files can also be exported from other
applications
![Page 8: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/8.jpg)
![Page 9: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/9.jpg)
Parallel corpora tmx
To be able to use a parallel corpora as a translation memory we need first to convert it to the tmx format.
We can either use a existing parallel corpora or create our own.
There are many open source web resources for creating our own parallel corpora
![Page 10: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/10.jpg)
Using open parallel corpora resources – English source
Jack London published about 40 books in English. Almost all his English- language works are publicly available at
– Project Gutenberg in: http://www.gutenberg.org/wiki/Main_Page
![Page 11: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/11.jpg)
Using open parallel corpora resources – Spanish source (s)
Among the many sources of Spanish translations of Jack London’s books there is:
http://apuntes.rincondelvago.com/trabajos_global/literatura/
![Page 12: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/12.jpg)
Aligning parallel texts
For example: Download
“White Fang” by Jack London from Project Gutenberg
and its translation
“Colmillo Blanco” from rincondelvago Use bitext2tmx (free open source application)
for alignment
![Page 13: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/13.jpg)
bitext2tmx aligner: configuration
![Page 14: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/14.jpg)
bitext2tmx aligner: text alignment
![Page 15: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/15.jpg)
Bitext2tmx producing a tmx-file
<?xml version="1.0" encoding="ISO-8859-1"?><tmx version="1.1"><header creationtool="Bitext2tmx" creationtoolversion="0.9" segtype="sentence" o-tmf="Bitext2tmx" adminlang="en" srclang="en" datatype="PlainText" o-encoding="ISO-8859-1"></header><body><tu tuid="0" datatype="Text"> <tuv lang="en"> <seg>CHAPTER I--THE TRAIL OF THE MEAT</seg> </tuv> <tuv lang="es"> <seg>PRIMERA PARTE -- La pista de la carne</seg>hsjhdjh </tuv></tu><tu tuid="1" datatype="Text"> <tuv lang="en"> <seg>Dark spruce forest frowned on either side the frozen waterway.</seg> </tuv> <tuv lang="es"> <seg>Aun lado y a otro del helado cauce de erguía un oscuro bosque de abetos de ceñudo aspecto.</seg> </tuv></tu>
![Page 16: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/16.jpg)
The tmx-file produced by bitext2tmx can be added to OmegaT’s tm directory to be used as part of the translation memory
![Page 17: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/17.jpg)
Other tools with Omegat
.tmx-files can be cleaned with tmxcleaner .tmx-files can be merged with tmxmerger .tmx-files can be validated with tmxvalidator
– (can be downloaded from the OmegaT site
It is important at least to validate the files before adding them to OmegaT’s translation memory.
![Page 18: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/18.jpg)
Current work: Using these Open Source resources, translating a book from English to Spanish with the students of applied linguistics at Colima University with IPN backing. Ready by the middle of November.
Linguistica
Computacional
![Page 19: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006](https://reader036.vdocument.in/reader036/viewer/2022062804/56649f1c5503460f94c31b71/html5/thumbnails/19.jpg)
Save your money. Use Open Source!