tms days 04 2012 manuel herranz pangea mt

22
PangeaMT Manuel Herranz PangeaMT : A Solution Built by the Language Industry for the Language Industry #manuelhrrnz #pangeanic E: [email protected] pangeanic

Upload: manuel-herranz

Post on 29-Jun-2015

439 views

Category:

Business


1 download

DESCRIPTION

Manuel Herranz presents at TMS Inspiration Days, on Pangeanic's use case, the application of MT to LSPs, the Pangeanic development case. Unveiling feature-rich PangeaMT Saas Power, Pangeanic's v3.

TRANSCRIPT

Page 1: Tms days 04 2012 manuel herranz pangea mt

PangeaMT

Manuel Herranz

PangeaMT : A Solution Built by the

Language Industry for the Language Industry

#manuelhrrnz #pangeanic E: [email protected]

Page 2: Tms days 04 2012 manuel herranz pangea mt

First, a word of thank you to Andrzej from LindoLang and XTRF for their support in the

Malima Project

Village of Gouria, Extreme North, Cameroon. Manuel has been working for 2 months on-site • Computerizing a school• Building a library• Teacher Training

• Designing water collection systems • Improving agricultural systems• Donating for local development• Helping with internationalization

Page 3: Tms days 04 2012 manuel herranz pangea mt

What is PangeaMT?• The first commercial application of Open Source Moses

(AMTA 2010, http://euromatrixplus.net/moses)• A development overcoming Moses limitations for

localization industry presented at Association for MT in the Americas : PangeaMT putting open standards to work... well AMTA 2010 http://bit.ly/uM8x6V

• 06/2011 PangeaMT launches the DIY Solution to Machine Translate independently and flexibly like never before http://bit.ly/kSd3wC

• 07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS

• 07/2011 A harness that eases re-training and updating DIY SMT as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU

• 02/2012 API for hosted solutions

Page 4: Tms days 04 2012 manuel herranz pangea mt

What is PangeaMT?2007/08

2009/10

2011/12

• DIY SMT • Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users• Compatibility with commercial formats (ttx, sdlxliff, itd)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production• EU Post-Editing Award

• V1: Small data sets (2-5M words), automotive & electronics• (ES), then Fr/It/De in other fields

• Division born • 00's of engine trials and language combinations• Open-Source to commercial• TMX / XLIFF workflows

Page 5: Tms days 04 2012 manuel herranz pangea mt

• Obtained data from customers and trained our own engines

• Translated tests with customized engines• Sent test data to internal translators with

detailed instructions (2 pages) for setting up a controlled experiment evaluation (EN ES/FR/IT)

• Within 1 year, output per translator doubled. Carried out ROI calculation on 500,000 words.

Translation MT+PE

Automotive 400 wph 900 wph

Marketing 250 wph 450 wph

Software 350 wph 1,000 wph

What is PangeaMT?

Page 6: Tms days 04 2012 manuel herranz pangea mt

What is PangeaMT?

1. Compatibility with TMX format to update intermediate translation memory (penalisation to user MT required) thanks to TMX parser

2. Possibility to translate % below (i.e. below 80%, 75% or full TMX)

3. Compatibility with XLIFF files thanks to XLIFF parser

4. Compatibility with SDL Trados .ttx files thanks to TTX parser, even if PangeaMT fosters the deployment of non-proprietary, open-standard formats

Page 7: Tms days 04 2012 manuel herranz pangea mt

What is PangeaMT?

5. Google-like internal translation panel – a password-protected, personalised web-based user interface, also called MT request panel. Highly intuitive and enables users to query the PangeaMT system for MT. MT files are delivered to their mailboxes or can get downloaded from the panel

6. Inliner: Code / tag parser for handling format in-lines (italics, bold, hyperlinks, etc)

7. Fully generated Language Model (LM)8. Glossary override capabilities. Also to handle DNTs9. API

Page 8: Tms days 04 2012 manuel herranz pangea mt

Summing Up

PangeaMT not an existing, out-of-the-box software, we believe in pre-existing know-how about MT, techniques, development experience, etc. in order to develop your engines. Engines that will not be wrapped in a software version for which you would have to keep paying every year, i.e. forever. In fact, each deployment and application is particular to each client.

First-time delivered PangeaMT engines need re-training for some time but they can be deployed for production extremely quickly.

Page 9: Tms days 04 2012 manuel herranz pangea mt

Traditional Translation Workflow

Application to Translation Companies

TM

TM

Next Generation MT-based

Find exact matches or %

Find exact matches or %

HumanTranslation

Post-editing / Revision

Post-editing / RevisionPANGEAMTTranslation

Page 10: Tms days 04 2012 manuel herranz pangea mt

How does PangeaMT work?

Unrest is continuing in Cairo as protesters set up their demand for Egypt’s military rulers to resign

+ specific language rules+ job or client glossary+ hybrid technologies

Page 11: Tms days 04 2012 manuel herranz pangea mt

What happens when languages have very different structures?

SYNTAX-BASED HYBRID SMTAltaic languages EnglishArabic European languagesAgglutinative Non- agglutinative

Output Translation

Data

Linguistic

Information

Language

Knowledge

Page 12: Tms days 04 2012 manuel herranz pangea mt

If you are searching for a system with an E-Value code , click the arrow below.

Tree= “If” <CC>, VBS <tree>“you”<NNP>“are”<VBM>“searching”<VB>…

you for a system with an E-Value code are searching if , the arrow click below.

E−Value コードを 指定して システムを 検索する 場合は、 下の 矢印を クリックして ください

How does PangeaMT work?SYNTAX-BASED HYBRID SMT

Page 13: Tms days 04 2012 manuel herranz pangea mt

How does PangeaMT work?SYNTAX-BASED HYBRID SMT

When available, the company plans to offer the following:

available When , the company the following : plans to offer :

発売 時 には、 同社は 次の バージョンを 提供する 予定 です 。

(VBPt3) (to) (VBinf) (DET) (NN)

(Predicate)Nipponization module

Translation & Cleaning

(Subject)

(VBPt) (to)

(ADV) (ADJ) (Punct) (DET) (NNSing)

(Cond clause),

Page 14: Tms days 04 2012 manuel herranz pangea mt

Don’t forget data cleaning!!!

<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg></tuv><tuv xml:lang="FR-FR"><seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg></tuv>

<tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"><tuv xml:lang="EN-US"><seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25&quot;; width –<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg></tuv><tuv xml:lang="ES-EM"><seg><bpt i="1">{\f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1&quot;.<ept i="3">}</ept></seg></tuv></tu>

<tuv xml:lang=“EN-US"><seg>On 22nd May we decided not to join the group.</seg><tuv xml:lang=“DE-DE"><seg>Am 22. </seg>

More cleaning

Cleaning

Page 15: Tms days 04 2012 manuel herranz pangea mt

Don’t forget data cleaning!!!

<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>The President of the United States visited Costa Rica.</seg></tuv><tuv xml:lang=“ES-ES"><seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg></tuv>

<tuv xml:lang=“JP"><seg> 同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。 英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg><tuv xml:lang=“EN-US"><seg>It is a journalistic point of view and strengths of the English-language newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg>

More cleaning

Cleaning

Page 16: Tms days 04 2012 manuel herranz pangea mt

More cleaning

Cleaning

Engine training with clean dataHaving approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.

Data cleaning modules• Remove any “suspects”:• Sentences that are too

long• Mismatches (of many

kinds!)• Terminological

inaccuracies• Non-useful segments, etc

Parallel text extraction / Translation input / Post-edited materialThis is often comes from CAT tools or document alignments, crawling

Data Cleaning (in-lines)Remove all non-translation data.

TMX Human approvalSome of this material may actually be OK for training. It is then input in the training set.

DATA CLEANING CYCLE (AUTOMATED)

Page 17: Tms days 04 2012 manuel herranz pangea mt

How does PangeaMT work?

On-line Automated Engine Training

Page 18: Tms days 04 2012 manuel herranz pangea mt

How does PangeaMT work?Translation request (+API, webservice, etc)

Page 19: Tms days 04 2012 manuel herranz pangea mt

How does PangeaMT work?

Page 20: Tms days 04 2012 manuel herranz pangea mt

• DIY SMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data,- financial, legal, institutional,- intelligence, knowledge-gathering,- product pre-release, etc

• Control Panel with full system statistics• First training and cleaning at

Pangeanic, then re-trainings and updates by the client (+support)

• SaaS model with updates for organizations which prefer outsourcing or cloud training.

Advantages of PangeaMT

Page 21: Tms days 04 2012 manuel herranz pangea mt

Myth: MT will never be as good as humans

“We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein

uhmmm, it is going to get really good...

2nd stagePE material and more data make engines even

more predictable. More specialist engines

3rd stageBeyond 2030... no predictions

1st stageWe are creating usable engines, first PE

experiences 2009-2015 or 2020

Page 22: Tms days 04 2012 manuel herranz pangea mt

PangeaMT

Manuel Herranz

Thanks!!Questions?

#manuelhrrnz #pangeanic E: [email protected]