taus open source machine translation showcase, paris, manuel herranz, pangeanic, 4 june 2012
DESCRIPTION
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCoreTRANSCRIPT
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE
PangeaMT Putting Open Standards to work16:00-16:15Monday 4 June
Manuel HerranzPangeanic
PangeaMT – putting open standards to work… well
Manuel Herranz#manuelhrrnz #pangeanic E: [email protected]
MACHINE TRANSLATION
Make myday,
•
•
•
•
•
•
I S N O T
I Sbecome
a post-editor
Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower
rule-based systems, IBM licenses, many linked to patent EN/RU & Intel
First statistical papers
1st Open source SMT
Translation industryappropriating Moseshttp://euromatrixplus.net/moses
DIY SMT
http://t.co/HDTboxQ
PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens
BEGINNING of data resources. Internet.Accessability toinformation
Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld
Types of LSPs (Ben Sargent – TMS Inspiration Days April ‘11 – Krakow)
a) develop it for their use and for their clients (developers of a system),
b) buyers of systems (they do not want the headache of starting from scratch and prefer to buy ready-build solutions) and finally
c) there are those who prefer the mix&match approach (buying some good solutions outside and building interfaces and what they know works best for their business). The trend is towards unification
2007/08
.
2009/10
2011/12
• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users
• Compatibility withcommercial formats (ttx, sdlxliff, itd)
2007 and before
• RB tests with commercial software• Insufficiently good output• Only internal production
• EU Post-Editing Award• V1: Small data sets (2-5M words), automotive & electronics
• (ES), then Fr/It/De in other fields
• Division born• 00's of engine trials and language combinations
• Open-Source to commercial
• TMX / XLIFF workflows
As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020
EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)
OBJECTIVES = CHALLENGES 2007 - 2010
Turn academic development (Moses) into a commercial application.
To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.
Lower entry level for MT. Bring affordability user control / empowerment to MT. Bring it to the user, take away from programmer.
How? By fostering open-standard geared translation automation strategies.
To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.
DIY SMT June 2011 http://t.co/HDTboxQ
9
The rush for data
Soon realised that there was a rush to gather data but that other resources around data were necessary
cleaning
More cleaning
10
cleaning
More cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atm sphère.</seg>
</tuv>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
11
cleaning
More cleaning
<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
•
•
•
•
•
•
•
•
•
Translation MT+PE
Automotive 400 wph 900 wph
Marketing 250 wph 450 wph
Software 350 wph 1,000 wph
•
•
•
•
•
•
Domains are managed at TM and at engine level
I created this engine with medical, pharma TMX and added environmental
TMs to boost coverage - Client deals with plant-based natural drugs / ayurveda
Tag-based TM selection
•
•
•
2015
2014
2013
2011
2010
2012
2018
2017
2016
User
em
po
werm
ent
• MT acceptance growth (still)
• Translator engagement challenge (being solved particularly with in-house translators & economic climate)
• Need for data is being addressed – still more work to be done.
• The difference will be madeby data handling and MTtechniques (hybrid, combination, syntax, re-ordering, etc)
• Users and practitioners now can build their own systems, A TREND BEING FOLLOWED BY OTHER PLAYERS.
Until 2011/12
YEAR2016
00
0's
of c
usto
miz
ed
MT
syste
ms
In 5 years... after 2017… where?
Tech. notthe realm of afew providers
Ubiquitious MT2009