gala webminar september 2013

29
PangeaMT Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank User-Empowering Data-Driven, In- Domain Machine Translation #pangeanic E: [email protected] pangeanic

Upload: pangeanic

Post on 03-Dec-2014

2.279 views

Category:

Technology


1 download

DESCRIPTION

Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.

TRANSCRIPT

Page 1: Gala Webminar September 2013

PangeaMT Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank

User-EmpoweringData-Driven, In-Domain

Machine Translation

#pangeanic E: [email protected]

Page 2: Gala Webminar September 2013

AGENDA• Industry reflections• Pangeanic PangeaMT• Customization as Key Initial Servicing Step of our MT

Offering• All about the PangeaMT Platform– Featuring Highlights and Demo– API : CAT Environment Integration (Demo)

• Q&A RoundGALA Marketplace Offer

Page 3: Gala Webminar September 2013

´1

´2

1.This is an example text. Go ahead and replace it with your own text.

2.This is an example text. Go ahead and replace it with your own text.

19951995

20052005

20152015

3.This is an example text. Go ahead and replace it with your own text.

4.This is an example text. Go ahead and replace it with your own text.

COST OF TRANSLATION (price/w) vs DEMAND

10-YEAR STEPS

DEM

AND

• Price per word a valid model?

• Is there an explanation?

• What can we do about it? Is there a future for the Language Industry?

• Unique to this industry?

Page 4: Gala Webminar September 2013

MASSIVE AMOUNTS OF DATA – IS LANGUAGE BUSINESS MANAGEABLE?

World’s data in Tb / Exa

Typi

cal T

rans

latio

n Vl

ume

1990 1995 2000

2005 2010 2015

Page 5: Gala Webminar September 2013

Why Machine Translation?

As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020

Estimates Up 50% a year (Oracle) Doubles every 11 hours (IBM)

Humankind has stored more than 295 billion gigabytes (or 295 exabytes) of data since 1986 ComputerWorld - 2011

Researchers at the University of California, Berkeley, that found the amount of data generated from the dawn of time through 2002 was about 5 exabytes.

Page 6: Gala Webminar September 2013

Why Machine Translation?The Data Deluge

As Content Volume Explodes, Machine Translation Becomes an Inevitable Part of Global Content Strategy http://ow.ly/jVuhZ

In 2011, it took about two days for the world to create the same 5 exabytes of data that it took human eons to generate.

In 2013, it took the world just 10 minutes to create 5 exabytes.

Eric Schmidt: Every 2 Days We Create As Much Information As We Did Up To 2003TechCrunch, 2010

The sixth power of 1,000 = 1018

1 EB = 1000000000000000000B = 1018bytes = 1000petabytes = 1 billion gigabytes.

Page 7: Gala Webminar September 2013

Where is data stored?

Page 8: Gala Webminar September 2013

What can I do with MT?Machine Translation application, NEW usage and success depend on

MT for assimilation: “gisting” or “understanding“Sports Politics

Social etc

Output format

• Practically unlimited demand; but free web-based services reduce incentive to improve technology

• Coverage + important. Instant quality MT for dissemination: “publication“

MT for direct communication

Output format

Sports Politics

Social etc

• Publishable quality that can only be achieved by humans. MT & tools a productivity booster

Output format

Output format

Sports Politics

Social etc• Current R&D, Military uses systems for

spoken MT, first applications for smartphones, online help, multilingual chat systems

Output format

Output format

Page 9: Gala Webminar September 2013

9

Short history Pangeanic: LSP. Major clients in Asia, European

localization, increasing number of languages Need to produce translation faster, cheaper… Experimenting with some RB MT systems

TAUS & TDA founding members Partnering with Valencia's Computer Science

Institute & Prof. F. Casacuberta / E. Vidal Research Team Commercial implementations of PangeaMT systems

at client side: SONY EUROPE, SYBASE, LSPs….

Page 10: Gala Webminar September 2013

10

Milestones EU Post-editing contract 2007 (RBMT output) Euromatrix mention AMTA 2010 AAMT 2011/12 (JP Hybridization and MT DIY) 1st commercial platform 2010 DIY 2011 (automated re-training cycles) SaaS Power, LocWorld Paris 2012

Improved automated cleaning cycles, Online automated training

Regional EU R&D Funds (“Feder” x 3: 2009-2011) & Marie Curie EXPERT Project

Page 11: Gala Webminar September 2013

Customization by the PangeaMT Team

Key to achieve better qualitative results later• Top-notch human and automated service• Focused on the Client from day one!• Prior to 1st-time Engine Delivery prior to Platform

Deployment (production)

• Customization concentrates on data and best engine consultancy• Data cleaning and enhancement• The impact of glossaries (in-domain, client-/product-

specific…)• Reporting (your data was like this…..now let’s do this)• Training Pangeanic tests all the development features in-house at a

TRANSLATION DEPARTMENT BEFORE RELEASE.

Page 12: Gala Webminar September 2013

Getting the data right:Automated cleaning and

preparationTMX data

Cleanup: Entities

Conversion

Cleanup: Characters

Two plain text files

Moses Cleanup: Segments

TokenizationLower-casing

Two aligned text files, no tags, lower-cased

MT engine training

Cleanup: Tags

Bilingual XML with inline tags/markup

XML entities like ©   etc.

Invalid characters

Remove: <ph> etc.

Empty linesSentence ratio wrong

Example: By default, èBy default ,

Example: HOUSE è house

Page 13: Gala Webminar September 2013

Don’t forget data cleaning!!!

<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg></tuv><tuv xml:lang="FR-FR"><seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg></tuv>

<tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"><tuv xml:lang="EN-US"><seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25&quot;; width –<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg></tuv><tuv xml:lang="ES-EM"><seg><bpt i="1">{\f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1&quot;.<ept i="3">}</ept></seg></tuv></tu>

<tuv xml:lang=“EN-US"><seg>On 22nd May we decided not to join the group.</seg><tuv xml:lang=“DE-DE"><seg>Am 22. </seg>

More cleaning

Cleaning

Page 14: Gala Webminar September 2013

Don’t forget data cleaning!!!

<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>The President of the United States visited Costa Rica.</seg></tuv><tuv xml:lang=“ES-ES"><seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg></tuv>

<tuv xml:lang=“JP"><seg>同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg><tuv xml:lang=“EN-US"><seg>It is a journalistic point of view and strengths of the English-language newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg>

More cleaning

Cleaning

Page 15: Gala Webminar September 2013

More cleaning

Cleaning

Engine training with clean dataHaving approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.

Data cleaning modules•Remove any “suspects”:•Sentences that are too long•Mismatches (of many kinds!)•Terminological inaccuracies•Non-useful segments, etc

Parallel text extraction / Translation input / Post-edited materialThis is often comes from CAT tools or document alignments, crawling

Data Cleaning (in-lines)Remove all non-translation data.

TMX Human approvalSome of this material may actually be OK for training. It is then input in the training set.

DATA CLEANING CYCLE (AUTOMATED)DATA CLEANING CYCLE (AUTOMATED)

Page 16: Gala Webminar September 2013

A Success StorySony Professional Europe, Salomé Lopez-LavadoNeeds-Improve publication French, Italian, Spanish-8M words training set-time-to-market: from 3 days down to 1,5 days: html, InDesign, -Outsourcing cost: -20%-Volume: 1,5M words/year

Japanese Automotive manufacturer-Spanish-8M words/year-Time to market reduced by 2 week – 3 weeks from 8 to 6 or 5 weeks-Team of 17 freelancers down to 4-7 post-editors-Outsourcing cost: -30%

Spanish LSP working for banking sector-Spanish-1-2M words/year-Time to market: 1-week to 2 days!!!!-Docx, html, tmx-Down from 2-3 in-house staff and 2-3 freelancers to 2 in-house!!!

http://ow.ly/peuFD

Successfully applied (3d-party applications/beneficiaries)

Page 17: Gala Webminar September 2013

Use Case -

✔Even with small data sets!!

Page 18: Gala Webminar September 2013

• PangeaMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data,- financial, legal, institutional,- intelligence, knowledge-gathering,- product pre-release, etc

• Control Panel + full system statistics

• Re-trainings and updates by the client for data privacy / more accuracy

Potential Uses of Machine Translation

Page 19: Gala Webminar September 2013

• Information discovery: patent, unknown documents,

• Automatic, on-demand creation of foreign language versions / web apps – keyword testing

• multilingual crawling, data discovery

• Pre-translation

Other Potential Uses of Machine Translation

Page 20: Gala Webminar September 2013

20

Polling Questions to Audience

Page 21: Gala Webminar September 2013

21

Platform overview• 24/7 control over your data and engines• secure, robust and scalable• user focused (permissions and empowering capabilities)• API linked, if need be• enabled us to offer an extraordinary flexible business model

- SaaS

- SaaS Power (online DIY, re-trainings included)

- Full Power (PLATFORM OWNERSHIP)

Page 22: Gala Webminar September 2013

PangeaMT System – Domain Creation

Page 23: Gala Webminar September 2013

PangeaMT System – Data Cleaning

Page 24: Gala Webminar September 2013

PangeaMT System – Engine Creation

Page 25: Gala Webminar September 2013

PangeaMT System – Engine Training

Page 26: Gala Webminar September 2013

26

PangeaMT API – SDL Plugin Demo Time(Video file)

Page 27: Gala Webminar September 2013

Myth: MT will never be as good as humans

“We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein

uhmmm, it is going to get really good...

2nd stagePE material and more data make engines even

more predictable. More specialist engines

3rd stageBeyond 2030... no predictions

1st stageWe are creating usable engines, first PE

experiences 2009-2015 or 2020

Page 28: Gala Webminar September 2013

GALA Marketplace Offer

[email protected]

Free Consultancy and Custom Engine Piloting Period

October-November 2013

Page 29: Gala Webminar September 2013

Q&A

Thank you!!

[email protected]