taus open source machine translation showcase, paris, manuel herranz, pangeanic, 4 june 2012

24
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE PangeaMT Putting Open Standards to work 16:00-16:15 Monday 4 June Manuel Herranz Pangeanic

Upload: taus-enabling-better-translation

Post on 03-Jul-2015

632 views

Category:

Technology


0 download

DESCRIPTION

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCore

TRANSCRIPT

Page 1: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

PangeaMT Putting Open Standards to work16:00-16:15Monday 4 June

Manuel HerranzPangeanic

Page 2: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

PangeaMT – putting open standards to work… well

Manuel Herranz#manuelhrrnz #pangeanic E: [email protected]

Page 3: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

MACHINE TRANSLATION

Make myday,

I S N O T

I Sbecome

a post-editor

Page 4: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower

rule-based systems, IBM licenses, many linked to patent EN/RU & Intel

First statistical papers

1st Open source SMT

Translation industryappropriating Moseshttp://euromatrixplus.net/moses

DIY SMT

http://t.co/HDTboxQ

Page 5: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens

BEGINNING of data resources. Internet.Accessability toinformation

Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld

Page 6: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Types of LSPs (Ben Sargent – TMS Inspiration Days April ‘11 – Krakow)

a) develop it for their use and for their clients (developers of a system),

b) buyers of systems (they do not want the headache of starting from scratch and prefer to buy ready-build solutions) and finally

c) there are those who prefer the mix&match approach (buying some good solutions outside and building interfaces and what they know works best for their business). The trend is towards unification

Page 7: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

2007/08

.

2009/10

2011/12

• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users

• Compatibility withcommercial formats (ttx, sdlxliff, itd)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production

• EU Post-Editing Award• V1: Small data sets (2-5M words), automotive & electronics

• (ES), then Fr/It/De in other fields

• Division born• 00's of engine trials and language combinations

• Open-Source to commercial

• TMX / XLIFF workflows

As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020

EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)

Page 8: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

OBJECTIVES = CHALLENGES 2007 - 2010

Turn academic development (Moses) into a commercial application.

To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.

Lower entry level for MT. Bring affordability user control / empowerment to MT. Bring it to the user, take away from programmer.

How? By fostering open-standard geared translation automation strategies.

To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.

DIY SMT June 2011 http://t.co/HDTboxQ

Page 9: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

9

The rush for data

Soon realised that there was a rush to gather data but that other resources around data were necessary

cleaning

More cleaning

Page 10: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

10

cleaning

More cleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>A system for recovering the methane that is emitted from the manure so that

it does not leak into the atmosphere.</seg>

</tuv>

<tuv xml:lang="FR-FR">

<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel

d'origine animale de sorte qu'il ne se dissipe pas dans l'atm sphère.</seg>

</tuv>

<tuv xml:lang=“EN-US">

<seg>On 22nd May we decided not to join the group.</seg>

<tuv xml:lang=“DE-DE">

<seg>Am 22. </seg>

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>The President of the United States visited Costa Rica.</seg>

</tuv>

<tuv xml:lang=“ES-ES">

<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora

Michelle, visitaron Costa Rica el pasado sábado.</seg>

</tuv>

Page 11: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

11

cleaning

More cleaning

<tuv xml:lang=“JP">

<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。

英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>

<tuv xml:lang=“EN-US">

<seg>It is a journalistic point of view and strengths of the English-

language newspaper Japan Times. It includes a description of the exciting and

rewarding work of translation and interpretation, as well as the introduction of

consciousness and how to acquire the required professional skills. The road to

becoming a translator and interpreter also down to the actual work site, a

comprehensive guide to interpreting the reality of today'stranslation industry.

</seg>

Page 13: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Translation MT+PE

Automotive 400 wph 900 wph

Marketing 250 wph 450 wph

Software 350 wph 1,000 wph

Page 14: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 15: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 16: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 17: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Page 18: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Domains are managed at TM and at engine level

Page 19: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

I created this engine with medical, pharma TMX and added environmental

TMs to boost coverage - Client deals with plant-based natural drugs / ayurveda

Tag-based TM selection

Page 20: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Page 21: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 22: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 23: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 24: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

2015

2014

2013

2011

2010

2012

2018

2017

2016

User

em

po

werm

ent

• MT acceptance growth (still)

• Translator engagement challenge (being solved particularly with in-house translators & economic climate)

• Need for data is being addressed – still more work to be done.

• The difference will be madeby data handling and MTtechniques (hybrid, combination, syntax, re-ordering, etc)

• Users and practitioners now can build their own systems, A TREND BEING FOLLOWED BY OTHER PLAYERS.

Until 2011/12

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

ms

In 5 years... after 2017… where?

Tech. notthe realm of afew providers

Ubiquitious MT2009