9. manuel harranz (pangeanic) hybrid solutions for translation
TRANSCRIPT
Alex Helle / Manuel Herranz
PangeaMT
Sharing Experiences on MT System,
Data management,
Hybridation
IntroBrief history
Pangea system introduction /
features for EXPERT
Hybridation experiences at
Pangeanic (+future work)
Intro
Brief history
http://youtu.be/K-HfpsHPmvw
• “1-2 million words an hour”• “quite adequate speed to
cope with the whole output of the Soviet Union in a week… a few hours computer time a week”
• [full scale production] “if our experiments go well, within 5 years or so”
What is PangeaMT? The first commercial application of Open Source Moses (AMTA 2010, http://euromatrixplus.net/moses)
A development overcoming Moses limitations for localizationindustry presented at Association for MT in the Americas : PangeaMT putting open standards to work... well AMTA 2010 http://bit.ly/uM8x6V
06/2011 PangeaMT launches the DIY Solution to Machine Translate independently and flexibly like never before http://bit.ly/kSd3wC
07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS
07/2011 A harness that eases re-training and updating DIY SMT as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU
02/2012 API for hosted solutions
What is PangeaMT?2007/08
2009/10
2011/12
• DIY SMT• Automated retraining• API v1• Glossary• Automated re-training• Transfer architecture and know-how to users• Compatibility withcommercial formats (ttx, sdlxliff, docx, odt)
2007 and before
• RB tests with commercial software• Insufficiently good output• Only internal production
• EU Post-Editing Award
• V1: Small data sets (2-5M words), automotive & electronics• (ES), then Fr/It/De in other fields
• Division born• 00's of engine trials and language combinations• Open-Source to commercial
• TMX / XLIFF workflows
2013
• Powerful API v2 for live translation• Confidence scores• Compatibility with more commercial formats
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s
military rulers to resign
+ specific language rules
+ job or client glossary
+ hybrid technologies
SMT at work
Data? best clean, thank youCleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{\f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–
<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
More cleaning
Cleaning
Data? best clean, thank youCleaning
More cleaning
Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg><tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
Data? best clean, thank youCleaning
Engine training with clean data
Having approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.
Data cleaning modules
• Remove any “suspects”:
• Sentences that are too long
• Mismatches (of many kinds!)
• Terminological inaccuracies
• Non-useful segments, etc
Parallel text extraction / Translation input / Post-edited material
This is often comes from CAT tools or document alignments, crawling
Data Cleaning (in-lines)
Remove all non-translation data.
TMX Human approval
Some of this material may actually be OK for training. It is then input in the training set.
System features – For EXPERTCleaning
System features – For EXPERTDomain
System features – For EXPERTEngine Creation
System features – For EXPERTEngine Training
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s
military rulers to resign
• specific language rules
• job / client glossary
• hybrid technologies
• good bleu tracking, ideal for experimentation
System features – For EXPERTTypically a 5 n-gram, DL, table
Different MT Systems for Different
Lang Pairs?
Related languages
SMT, with accurate n-gram training and in-domain data (typically 5, distorsion limit, weighs and fine-tuning)
Morphology-rich languages
Data is not enough and casuistry too large (Baltic languages like Lavian are extreme, Turkish is regular but too many suffixes) SMT cannot cope. Rule-based or Hybrid
Syntactically distant languages
Need additional information, this is where different HYBRID TECHNIQUES come into place. NO “SIZE FITS ALL”
- when the syntactic distance between languages is very large (unrelated languages). Patterns are lost (or not found) monotone TR
-
-
Hybridation Experiences at PangeanicRationale
Output Translation
Data
LinguisticInformation
LanguageKnowledge
SYNTAX-BASED HYBRID SMT
Altaic languages English
Arabic European languages
Agglutinative Non- agglutinative
Output Translation
Data
LinguisticInformation
LanguageKnowledge
Hybridation Experiences at PangeanicTWO OPTIONS
RE-ORDERING
Toshiba / Mecab benchmarking EN JP
CHALLENGES
SVO vs SOV
Tokenization: No spaces between words Mecab/KyTea for JP, Peterson Segmentor for ZH
RBMT systems have traditionally worked with linguistic & morphological analyzers. Thus “units” were segmented.
SMT can’t and so we need to tokenize to leave similar amount of “words” on both sides Giza++ can then relate words and groups.
Hybridation Experiences at PangeanicTWO METHODS
CHALLENGES
SVO vs SOV
Hybridation Experiences at PangeanicTWO OPTIONS
CHALLENGES
SVO vs SOV
Re-ordering?
Phrase-based or hierarchical models (syntactical)?
Hybridation Experiences at PangeanicTWO METHODS
Continue to press the button to scroll through the components of the program until
the display shows the desired current selection.
Japanese proper word order would be
the display the desired current selection shows until the components the program of
through to scroll the button to press continue.
SYNTAX-BASED (TREE) FOR HYBRID SMT
Hybridation Experiences at PangeanicSyntax-based analysis & re-ordering rules
Tree depth: 10Calc time +59% !!
When available, the company plans to offer the following:
available When , the company the following : plans to offer :
発売時には、同社は次のバージョンを提供する予定です。
(VBPt3) (to) (VBinf) (DET) (NN)
(Predicate)
Nipponization module
Translation & Cleaning
(Subject) (VBPt) (to)
(ADV) (ADJ) (Punct) (DET) (NNSing)
(Cond clause),
SYNTAX-BASED RULES FOR HYBRID SMT
Hybridation Experiences at PangeanicSyntax-based analysis & re-ordering rules
TOSHIBA vs MECAB
Toshiba’s The Honyaku is a established RB system (+30 years)
Lacks flexibility, rules contradict each other
Proposal: re-arrange whole corpus EN for JP with Toshiba’srules, but this meant dependency on a proprietary system forfuture inputs.
Hybridation Experiences at PangeanicTWO OPTIONS
TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
5-fold structure
Hybridation Experiences at PangeanicTWO OPTIONS
TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’sFirst Steps Toward ENJP MT Hybridation
Hybridation Experiences at PangeanicTWO OPTIONS
TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’sFirst Steps Toward ENJP MT Hybridation
Hybridation Experiences at PangeanicTWO OPTIONS
Future (current) Work on Hybrids
Morphology-rich langs: RU in particular.
Improve DE
Distant languages: re-ordering for AR?
Agglutinative langs: TK – new paradigm
IntroBrief history
Pangea system introduction /
features for EXPERT
Hybridation experiences at
Pangeanic (+future work)
#manuelhrrnz #pangeanic pangeanic