taus machine translation showcase, the simplified guide to getting started in smt, precision...

28
TAUS MACHINE TRANSLATION SHOWCASE Vancouver, Canada The Simplified Guide to Getting Started in SMT Wednesday, 29 October 2014 Tom Hoar, Precision Translation Tools The research within the project MosesCore leading to these results has received funding from the European Union 7th Framework Programme, grant agreement no 288487

Upload: taus-enabling-better-translation

Post on 20-Aug-2015

356 views

Category:

Technology


4 download

TRANSCRIPT

TAUS  MACHINE  TRANSLATION  SHOWCASE  Vancouver,  Canada  

The Simplified Guide to Getting Started in SMT Wednesday, 29 October 2014 Tom Hoar, Precision Translation Tools

The  research  within  the  project  MosesCore  leading  to  these  results  has  received  funding  from  the  European  Union  7th  Framework  Programme,  grant  agreement  no  288487  

 

The  Simplified  Guide  to    GeGng  Started  in  SMT  

Professional  tools    Professional  experIse  

PTTools  

•  SoJware  vendor  -­‐  founded  Feb  2010  – Adobe  :  Photoshop  – PTTools  :  DoMT  

•  DoMT  brand  – DoMT  Deskop:  organize  and  manage  training  corpora,  models  and  custom  workflows.  

– DoMT  Server:  automaIon  soluIon  

•  Customer  educaIon  

Who We Are

AGENDA  

Current  State  of  SMT  GeGng  Started  Skill  Requirements  Use  Cases  Q&A  

Current SMT

Current  State  

•  Who  has  not  heard  of  SMT?  •  Requires  powerful,  expensive  hardware  •  Huge  translaIon  memories  •  Complicated  processes  •  Dearth  of  skilled  personnel  

Current SMT

Then  vs  Now  

Current SMT

2007   2014  

Hardware   50  CPUs  in  private  cloud   One  24-­‐CPU  machine  

Mega  corpus   2  weeks   36  hours  

Cost   US  $100K++   US  $1,500  

1992   2014  

Computer   SGI  @  $100K   Dell  @  $5,000  

SoGware   Eclipse  Alias  @$25K   Adobe  CS  Cloud  $1,500  

Graphic  ProducKon   $300  per  hour   $30++  per  hour  

Business  Models  

•  Where  is  the  work  done?  •  Who  does  the  work?  •  Outsourced  

– Free  – For  Fee  

•  Insourced  – Enterprise  Server  – Desktop  ApplicaIon  

Current SMT

Reality  2014  

•  Inexpensive  capable  hardware  exists  •  TranslaIon  memories  within  reach  •  Processes  migraIng  to  soJware  •  Training  available  for  exisIng  personnel  

Current SMT

AGENDA  

Current  State  of  SMT  GeLng  Started  Skill  Requirements  Use  Cases  Q&A  

“Simple Guide”

Is  Academic  Moses  Enough?  

“There  are  considerable  amounts  of  addiIonal  funcIonality...  that  are  not  included  in  Moses  that  are  essenIal  in  order  to  offer  a  strong  and  innovaIve  commercial  MT  plajorm.”    – Philipp  Koehn  –  Professor,  University  of  Edinburgh  

(http://kv-emptypages.blogspot.com/2013/09/understanding-mt-customization.html)

“Simple Guide”

GeGng  Started  

•  Manage  Corpora  •  Mange  SMT  Models  •  Produce  MT  •  Post  Edit  Results  

“Simple Guide”

Manage  Corpora  

•  Acquire  – TranslaIon  memory  archives  – Public  corpora  – Convert  docs  – Recycle  post-­‐edited  MT  

•  Process  – Transform/filter  – Curate/categorize  

“Simple Guide”

Manage  SMT  Models  

•  Train  TranslaIon  models  •  Train  Language  model  •  Tune  SMT  model  •  Evaluate  SMT  model  •  Deploy  SMT  engine  •  Versioning  

“Simple Guide”

Produce  MT  

•  Manual  –  Import/export  TMX    –  Import/Export  XLIFF  – Doc-­‐to-­‐doc  support  

•  AutomaIon  – TMS  IntegraIon  – CAT  IntegraIon  

“Simple Guide”

Post-­‐edit  Results  

•  Subject  of  other  presentaIons  •  Recycle  as  new  corpus?  

“Simple Guide”

AGENDA  

Current  State  of  SMT  GeGng  Started  Skill  Requirements  Use  Cases  Q&A  

Human Resources

SMT  Specialists  

•  ComputaIonal  linguists  are  scienIst  who  specialize  in  language  and  compuIng  to  create  and  advance  the  science.  

•  Specialists  are  localizaIon  engineers  who  review  the  data  and  select  tools  to  prepare  a  training  corpus  that  minimizes  post-­‐ediIng  in  commercial  producIon.  

Human Resources

Specialist’s  Required  Skills  

•  OrganizaIon  skills  (e.g.  manage  TM’s)  •  Observant  of  paserns  •  Willingness  to  learn  •  Regular  expression  –  helpful  •  Programming  skills  –  unnecessary  •  ComputaIonal  linguists  –  unnecessary  •  System  Administrator  –  unnecessary  

Human Resources

Observant  of  Paserns  

Human Resources

Technical pattern

Linguistic patterns

Observant  of  Paserns  

<ut>{\cs6\f1\cf6\lang1024  </ut>  &lt;span  class=&quot;small-­‐text&quot;&gt;  <ut>}  </ut>Copyright  ©  1997-­‐2009  &amp;nbsp;\  n  \  n  •  Archived  TMX  content  

– RTF  – HTML  &  XML-­‐escaped  HTML  – XML  – Broken  programmer’s  markup  

Human Resources

AGENDA  

Current  State  of  SMT  GeGng  Started  Skill  Requirements  Use  Cases  Q&A  

Use Cases

Use  Cases  

•  Large  LSP  – Extensive  MT  experience  – CSA  Top  10  

•  2  Medium  LSP’s  – Post-­‐ediIng  experience  –  In-­‐house  localizaIon  engineers  

•  Freelance  Translator  – United  NaIons  contractor  – Technically  savvy   Use Cases

Welocalize  

•  Work:  SoJware  localizaIon  •  Hardware:  Virtual  machines  for  pilot  •  SMT  models:  EN-­‐ES,  EN-­‐DE,  EN-­‐ZH,  EN-­‐RU  •  Corpus:  All  corpora  <  500,000  segment  pairs  •  Training:  3-­‐month  pilot  •  Results:  “Approached  outsourcing  vendors”  

– Zero-­‐edit  measure:  25-­‐45%  

Use Cases

EQHO  CommunicaIons  

•  Work:  SoJware  localizaIon    •  Hardware:  $1,500  new  6-­‐core  computer  •  SMT  model:  EN  <-­‐>  European  language  •  Corpus:  ~130,000  segment  pairs  •  Training:  3  month  pilot  •  Results:  BLEU’s  80  to  85  

– Zero-­‐edit  measure:  23-­‐43%  

Use Cases

Mid-­‐sized  European  LSP  

•  Work:  Financial  and  regulatory  reports  •  SMT  model:  EN  <-­‐>  European  language  •  Corpus:  ~800,000  segment  pairs  (25  years)  •  Training:  20  hours  of  tutorials  over  2  months  •  Homework:  Categorize  TM’s  for  4+  months  •  Results:  BLEU’s  rose  from  low  50’s  to  mid-­‐80’s  

Use Cases

Freelance  Translator  

•  Work:  United  NaIons  environmental  reports  •  Hardware:  $1,500  new  6-­‐core  computer  •  SMT  model:  EN  <-­‐>  European  language  •  Corpus:  ~250,000  segment  pairs  (25  years)  •  Training:  40  hours  of  tutorials  over  2  months  •  Results:  BLEU’s  75  to  85  

– Zero-­‐edit  measure:  averaged  35%  

Use Cases

Conclusion  

•  Regardless  of  business  model  – Mange  Corpora  – Generate  Models  – Product  MT  – Publish  Results  

•  Re-­‐purpose  exisIng  staff  with  training  •  Rightsourcing  

AGENDA  

Current  State  of  SMT  GeGng  Started  Skill  Requirements  Use  Cases  Q&A