tms days 04 2012 manuel herranz pangea mt
DESCRIPTION
Manuel Herranz presents at TMS Inspiration Days, on Pangeanic's use case, the application of MT to LSPs, the Pangeanic development case. Unveiling feature-rich PangeaMT Saas Power, Pangeanic's v3.TRANSCRIPT
PangeaMT
Manuel Herranz
PangeaMT : A Solution Built by the
Language Industry for the Language Industry
#manuelhrrnz #pangeanic E: [email protected]
First, a word of thank you to Andrzej from LindoLang and XTRF for their support in the
Malima Project
Village of Gouria, Extreme North, Cameroon. Manuel has been working for 2 months on-site • Computerizing a school• Building a library• Teacher Training
• Designing water collection systems • Improving agricultural systems• Donating for local development• Helping with internationalization
What is PangeaMT?• The first commercial application of Open Source Moses
(AMTA 2010, http://euromatrixplus.net/moses)• A development overcoming Moses limitations for
localization industry presented at Association for MT in the Americas : PangeaMT putting open standards to work... well AMTA 2010 http://bit.ly/uM8x6V
• 06/2011 PangeaMT launches the DIY Solution to Machine Translate independently and flexibly like never before http://bit.ly/kSd3wC
• 07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS
• 07/2011 A harness that eases re-training and updating DIY SMT as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU
• 02/2012 API for hosted solutions
What is PangeaMT?2007/08
2009/10
2011/12
• DIY SMT • Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users• Compatibility with commercial formats (ttx, sdlxliff, itd)
2007 and before
• RB tests with commercial software• Insufficiently good output• Only internal production• EU Post-Editing Award
• V1: Small data sets (2-5M words), automotive & electronics• (ES), then Fr/It/De in other fields
• Division born • 00's of engine trials and language combinations• Open-Source to commercial• TMX / XLIFF workflows
• Obtained data from customers and trained our own engines
• Translated tests with customized engines• Sent test data to internal translators with
detailed instructions (2 pages) for setting up a controlled experiment evaluation (EN ES/FR/IT)
• Within 1 year, output per translator doubled. Carried out ROI calculation on 500,000 words.
Translation MT+PE
Automotive 400 wph 900 wph
Marketing 250 wph 450 wph
Software 350 wph 1,000 wph
What is PangeaMT?
What is PangeaMT?
1. Compatibility with TMX format to update intermediate translation memory (penalisation to user MT required) thanks to TMX parser
2. Possibility to translate % below (i.e. below 80%, 75% or full TMX)
3. Compatibility with XLIFF files thanks to XLIFF parser
4. Compatibility with SDL Trados .ttx files thanks to TTX parser, even if PangeaMT fosters the deployment of non-proprietary, open-standard formats
What is PangeaMT?
5. Google-like internal translation panel – a password-protected, personalised web-based user interface, also called MT request panel. Highly intuitive and enables users to query the PangeaMT system for MT. MT files are delivered to their mailboxes or can get downloaded from the panel
6. Inliner: Code / tag parser for handling format in-lines (italics, bold, hyperlinks, etc)
7. Fully generated Language Model (LM)8. Glossary override capabilities. Also to handle DNTs9. API
Summing Up
PangeaMT not an existing, out-of-the-box software, we believe in pre-existing know-how about MT, techniques, development experience, etc. in order to develop your engines. Engines that will not be wrapped in a software version for which you would have to keep paying every year, i.e. forever. In fact, each deployment and application is particular to each client.
First-time delivered PangeaMT engines need re-training for some time but they can be deployed for production extremely quickly.
Traditional Translation Workflow
Application to Translation Companies
TM
TM
Next Generation MT-based
Find exact matches or %
Find exact matches or %
HumanTranslation
Post-editing / Revision
Post-editing / RevisionPANGEAMTTranslation
How does PangeaMT work?
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s military rulers to resign
+ specific language rules+ job or client glossary+ hybrid technologies
What happens when languages have very different structures?
SYNTAX-BASED HYBRID SMTAltaic languages EnglishArabic European languagesAgglutinative Non- agglutinative
Output Translation
Data
Linguistic
Information
Language
Knowledge
If you are searching for a system with an E-Value code , click the arrow below.
Tree= “If” <CC>, VBS <tree>“you”<NNP>“are”<VBM>“searching”<VB>…
you for a system with an E-Value code are searching if , the arrow click below.
E−Value コードを 指定して システムを 検索する 場合は、 下の 矢印を クリックして ください
How does PangeaMT work?SYNTAX-BASED HYBRID SMT
How does PangeaMT work?SYNTAX-BASED HYBRID SMT
When available, the company plans to offer the following:
available When , the company the following : plans to offer :
発売 時 には、 同社は 次の バージョンを 提供する 予定 です 。
(VBPt3) (to) (VBinf) (DET) (NN)
(Predicate)Nipponization module
Translation & Cleaning
(Subject)
(VBPt) (to)
(ADV) (ADJ) (Punct) (DET) (NNSing)
(Cond clause),
Don’t forget data cleaning!!!
<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg></tuv><tuv xml:lang="FR-FR"><seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg></tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"><tuv xml:lang="EN-US"><seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25"; width –<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1".</seg></tuv><tuv xml:lang="ES-EM"><seg><bpt i="1">{\f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1".<ept i="3">}</ept></seg></tuv></tu>
<tuv xml:lang=“EN-US"><seg>On 22nd May we decided not to join the group.</seg><tuv xml:lang=“DE-DE"><seg>Am 22. </seg>
More cleaning
Cleaning
Don’t forget data cleaning!!!
<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>The President of the United States visited Costa Rica.</seg></tuv><tuv xml:lang=“ES-ES"><seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg></tuv>
<tuv xml:lang=“JP"><seg> 同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。 英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg><tuv xml:lang=“EN-US"><seg>It is a journalistic point of view and strengths of the English-language newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg>
More cleaning
Cleaning
More cleaning
Cleaning
Engine training with clean dataHaving approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.
Data cleaning modules• Remove any “suspects”:• Sentences that are too
long• Mismatches (of many
kinds!)• Terminological
inaccuracies• Non-useful segments, etc
Parallel text extraction / Translation input / Post-edited materialThis is often comes from CAT tools or document alignments, crawling
Data Cleaning (in-lines)Remove all non-translation data.
TMX Human approvalSome of this material may actually be OK for training. It is then input in the training set.
DATA CLEANING CYCLE (AUTOMATED)
How does PangeaMT work?
On-line Automated Engine Training
How does PangeaMT work?Translation request (+API, webservice, etc)
How does PangeaMT work?
• DIY SMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data,- financial, legal, institutional,- intelligence, knowledge-gathering,- product pre-release, etc
• Control Panel with full system statistics• First training and cleaning at
Pangeanic, then re-trainings and updates by the client (+support)
• SaaS model with updates for organizations which prefer outsourcing or cloud training.
Advantages of PangeaMT
Myth: MT will never be as good as humans
“We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein
uhmmm, it is going to get really good...
2nd stagePE material and more data make engines even
more predictable. More specialist engines
3rd stageBeyond 2030... no predictions
1st stageWe are creating usable engines, first PE
experiences 2009-2015 or 2020