machine translation and the translator - termcoord · machine translation and the translator...
TRANSCRIPT
Machine Translationand the Translator
Philipp Koehn
8 April 2015
Philipp Koehn Machine Translation 8 April 2015
1About me
• Professor at Johns Hopkins University (US), University of Edinburgh (Scotland)
• Author of textbook on statistical machine translation
• Leading development of open source Moses toolkit
– developed since 2006
– reference implementationof state-of-the art methods
– used in academiaas benchmark and testbed
– extensive commercial deployment(20% of MT market)
Philipp Koehn Machine Translation 8 April 2015
2Recent Projects
• Speech translation
• Computer aided translation
Development of an open source toolkit tightly integrated with machine translation
Novel types of assistance for translators Adaptation of machine translation to user needs
• Open source infrastructure
MOSES CORE
Philipp Koehn Machine Translation 8 April 2015
6Quality
HTER assessment
0%publishable
10%editable
20%
30% gistable
40% triagable
50%
(scale developed in preparation of DARPA GALE programme)
Philipp Koehn Machine Translation 8 April 2015
7Applications
HTER assessment application examples
0% Seamless bridging of language dividepublishable Automatic publication of official announcements
10%editable Increased productivity of human translators
20% Access to official publicationsMulti-lingual communication (chat, social networks)
30% gistable Information gatheringTrend spotting
40% triagable Identifying relevant documents
50%
Philipp Koehn Machine Translation 8 April 2015
8Current State of the Art
HTER assessment language pairs and domains
0%publishable French-English restricted domain
10% French-English news storieseditable German-English news stories
20% Chinese-English news storiesEnglish-Czech open domain
30% gistable English-Japanese open domain
40% triagable
50%
(informal rough estimates by presenter)
Philipp Koehn Machine Translation 8 April 2015
10A Clear Plan
Source Target
Lexical Transfer
Interlingua
Philipp Koehn Machine Translation 8 April 2015
11A Clear Plan
Source Target
Lexical Transfer
Syntactic Transfer
InterlinguaAna
lysis
Generation
Philipp Koehn Machine Translation 8 April 2015
12A Clear Plan
Source Target
Lexical Transfer
Syntactic Transfer
Semantic Transfer
Interlingua
Analys
isGeneration
Philipp Koehn Machine Translation 8 April 2015
13A Clear Plan
Source Target
Lexical Transfer
Syntactic Transfer
Semantic Transfer
Interlingua
Analys
isGeneration
Philipp Koehn Machine Translation 8 April 2015
14Learning from Data
statistical analysis statistical analysis
foreign/Englishparallel text
Englishtext
TranslationModel
LanguageModel
Decoding Algorithm
Philipp Koehn Machine Translation 8 April 2015
15Finding the Best Translation
eBEST = argmaxe p(e|f)
Philipp Koehn Machine Translation 8 April 2015
17Word Translation Problems
• Words are ambiguous
He deposited money in a bank accountwith a high interest rate.
Sitting on the bank of the Mississippi,a passing ship piqued his interest.
• How do we find the right meaning, and thus translation?
• Context should be helpful
Philipp Koehn Machine Translation 8 April 2015
18Phrase Translation Problems
• Idiomatic phrases are not compositional
It’s raining cats and dogs.
Es schuttet aus Eimern.(it pours from buckets.)
• How can we translate such larger units?
Philipp Koehn Machine Translation 8 April 2015
19Syntactic Translation Problems
• Languages have different sentence structure
das behaupten sie wenigstensthis claim they at leastthe she
19Syntactic Translation Problems
• Languages have different sentence structure
das behaupten sie wenigstensthis claim they at leastthe she
• Convert from object-verb-subject (OVS) to subject-verb-object (SVO)
• Ambiguities can be resolved through syntactic analysis
– the meaning the of das not possible (not a noun phrase)– the meaning she of sie not possible (subject-verb agreement)
Philipp Koehn Machine Translation 8 April 2015
20Semantic Translation Problems
• Pronominal anaphora
I saw the movie and it is good.
• How to translate it into German (or French)?
20Semantic Translation Problems
• Pronominal anaphora
I saw the movie and it is good.
• How to translate it into German (or French)?
– it refers to movie– movie translates to Film– Film has masculine gender– ergo: it must be translated into masculine pronoun er
• We are not handling this very well[Le Nagard and Koehn, 2010]
20Semantic Translation Problems
• Pronominal anaphora
I saw the movie and it is good.
• How to translate it into German (or French)?
– it refers to movie– movie translates to Film– Film has masculine gender– ergo: it must be translated into masculine pronoun er
• We are not handling this very well[Le Nagard and Koehn, 2010]
Philipp Koehn Machine Translation 8 April 2015
21Semantic Translation Problems
• Coreference
Whenever I visit my uncle and his daughters,I can’t decide who is my favorite cousin.
• How to translate cousin into German? Male or female?
• Complex inference required
Philipp Koehn Machine Translation 8 April 2015
22No Single Right Answer
Israeli officials are responsible for airport security.Israel is in charge of the security at this airport.The security work for this airport is the responsibility of the Israel government.Israeli side was in charge of the security of this airport.Israel is responsible for the airport’s security.Israel is responsible for safety work at this airport.Israel presides over the security of the airport.Israel took charge of the airport security.The safety of this airport is taken charge of by Israel.This airport’s security is the responsibility of the Israeli security officials.
Philipp Koehn Machine Translation 8 April 2015
23Learning from Data
• What is the best translation?
Sicherheit→ security 14,516Sicherheit→ safety 10,015Sicherheit→ certainty 334
Philipp Koehn Machine Translation 8 April 2015
24Learning from Data
• What is the best translation?
Sicherheit→ security 14,516Sicherheit→ safety 10,015Sicherheit→ certainty 334
• Counts in European Parliament corpus
Philipp Koehn Machine Translation 8 April 2015
25Learning from Data
• What is the best translation?
Sicherheit→ security 14,516Sicherheit→ safety 10,015Sicherheit→ certainty 334
• Phrasal rulesSicherheitspolitik→ security policy 1580
Sicherheitspolitik→ safety policy 13Sicherheitspolitik→ certainty policy 0
Lebensmittelsicherheit→ food security 51Lebensmittelsicherheit→ food safety 1084Lebensmittelsicherheit→ food certainty 0
Rechtssicherheit→ legal security 156Rechtssicherheit→ legal safety 5
Rechtssicherheit→ legal certainty 723
Philipp Koehn Machine Translation 8 April 2015
27Phrase-Based Model
natürlich hat John Spaß am Spiel
of course John has fun with the game
• Foreign input is segmented in phrases
• Each phrase is translated into English
• Phrases are reordered
• Workhorse of today’s statistical machine translation
Philipp Koehn Machine Translation 8 April 2015
28Synchronous Grammar Rules
• Nonterminal rules
NP→ DET1 NN2 JJ3 | DET1 JJ3 NN2
• Terminal rules
N→maison | house
NP→ la maison bleue | the blue house
• Mixed rules
NP→ la maison JJ1 | the JJ1 house
Philipp Koehn Machine Translation 8 April 2015
29Learning Rules
I shall be passing on to you some comments
PRP MD VB VBG RP TO PRP DT NNS
NPPP
VP
VP
VP
S
Ich werde Ihnen die entsprechenden Anmerkungen aushändigen
Extracted rule: VP→ X1 X2 aushandigen | passing on PP1 NP2
Philipp Koehn Machine Translation 8 April 2015
30Syntax Decoding
SiePPER
willVAFIN
eineART
TasseNN
KaffeeNN
trinkenVVINF
NP
VPS
PROshe
VBdrink
NN|
cup
IN|
of
NP
PP
NN
NP
DET|a
VBZ|
wantsVB
VPVP
NPTO|
to
NNcoffee
S
PRO VP
➏
➊ ➋ ➌
➍
➎
Philipp Koehn Machine Translation 8 April 2015
31New State of the Art
• Good results for German–English [WMT 2014]
language pair syntax preferredGerman–English 57%English–German 55%
• Mixed for other language pairs
language pair syntax preferredCzech–English 44%Russian–English 44%Hindi–English 54%
• Also very successful for Chinese–English
Philipp Koehn Machine Translation 8 April 2015
rank
frequency
33Sparse Data
• Statistical estimation often suffers from sparse data
• Zipf’s law
– most words are extremely rare
– frequency× rank = constant
• Statistics from Europarl
– the occurs 1,929,379 times
– large tail of words that occur once:33,447 words, for instancecornflakes, mathematicians, or Tazhikhistan
Philipp Koehn Machine Translation 8 April 2015
34Brown Clusters
• Main idea: share evidence with similar words
• Cluster words to reduce sparsity
presented the laconic messagepursued these pompous lesson
aired that melancholic lettercommissioned this bouncy counterfactuals
published incompletable stunner
• For instance: use in language modeling
p(cluster(message)|cluster(presented), cluster(the), class(laconic)
Philipp Koehn Machine Translation 8 April 2015
37Deep Learning
• Autoencoders
– first: learn embeddings unsupervised– then: supervised learning of task
• Neural network language models
– several implementations– some integrated in Moses
• Neural networks everywhere
– translation model– reordering model– operation sequence model
Philipp Koehn Machine Translation 8 April 2015
39Big Data
For many language pairs, lots of text available.
Text you read in your lifetime
Translated textavailable
English textavailable
300 million words
billions of words
trillions of words
Philipp Koehn Machine Translation 8 April 2015
40Mining the Web
• Largest source for test: the World Wide Web
• Common Crawl
– publicly available crawl of the web
– hosted by Amazon Web Services, but can be downloaded
– regularly updated (semi-annual)
– 2-4 billion web pages per crawl
• Currently filling up our hard drives
Philipp Koehn Machine Translation 8 April 2015
41Monolingual Data
• Starting point: 35TB of text
• Processing pipeline [Buck et al., 2014]
– language detection– reduplication– normalization of Unicode characters– sentence splitting
• Obtained corpora
Language Lines (B) Tokens (B) Bytes BLEU (WMT)English 59.13 975.63 5.14 TB -German 3.87 51.93 317.46 GB +0.5Spanish 3.50 62.21 337.16 GB -French 3.04 49.31 273.96 GB +0.6Russian 1.79 21.41 220.62 GB +1.2Czech 0.47 5.79 34.67 GB +0.6
Philipp Koehn Machine Translation 8 April 2015
42Parallel Data
• Basic processing pipeline [Smith et al., 2013]– find parallel web pages (based on URL only)– align document by HTML structure– sentence splitting and tokenization– sentence alignment– filtering (remove boilerplate)
• Obtained corporaFrench German Spanish Russian Japanese Chinese
Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42MForeign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14MEnglish Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M
Bengali Farsi Telugu Somali Kannada PashtoSegments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0KForeign Tokens 573K 477K 336K 318K 305K 208KEnglish Tokens 537K 459K 358K 325K 297K 218K
• Much more work needed!
Philipp Koehn Machine Translation 8 April 2015
44Post-Editing Machine Translation
(source: Autodesk)
Philipp Koehn Machine Translation 8 April 2015
45Interactivity
• Traditional professional translation approaches
– translation from scratch– post-editing translation memory match– post-editing machine translation output
• More interactive collaboration between machine and professional?
Philipp Koehn Machine Translation 8 April 2015
46Interactive Machine Translation
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
|
Philipp Koehn Machine Translation 8 April 2015
47Interactive Machine Translation
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
| He
Philipp Koehn Machine Translation 8 April 2015
48Interactive Machine Translation
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
He | has
Philipp Koehn Machine Translation 8 April 2015
49Interactive Machine Translation
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
He has | for months
Philipp Koehn Machine Translation 8 April 2015
50Interactive Machine Translation
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
He planned |
Philipp Koehn Machine Translation 8 April 2015
51Interactive Machine Translation
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
He planned | for months
Philipp Koehn Machine Translation 8 April 2015
52Word Alignment Visualization
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
He planned for months to give a lecture in New York | in
Philipp Koehn Machine Translation 8 April 2015
53Word Alignment Visualization
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten.
Professional Translator
He planned for months to give a lecture in New York | in
Philipp Koehn Machine Translation 8 April 2015
54Shading off Translated Material
Input Sentence
Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten .
Professional Translator
He planned for months to give a lecture in New York | in
Philipp Koehn Machine Translation 8 April 2015
55Choices
• Trigger the passive vocabulary
• Display multiple translations for words and phrases
er hat seit Monaten geplant , im April einen Vortrag ...
he has for months the plan in April a lecture ...it has for months now planned , in April a presentation ...
he was for several months planned to in the April a speech ...he has made since months the pipeline in April of a statement ...
he did for many months scheduled the April a general ...
• Rank and color-highlight by probability of each translation
• Prefer diversity
Philipp Koehn Machine Translation 8 April 2015
56Instant Feedback Loop
source text
MTtranslation
human translation
MTengine
post-editre-train
translate
Philipp Koehn Machine Translation 8 April 2015
57CASMACAT Home Edition
• Available as open source software
• Features
– installation on any desktop machine
– allows training of MT engines
– all new types of assistance
– incremental updating of models
• Warning: still in development stage(help welcome!)
Philipp Koehn Machine Translation 8 April 2015
59Summary
• Machine translation is not perfect, but useful
• Better models(esp. syntax)
• Better machine learning(esp. neural networks)
• More data
• Closer integration with target application(e.g., computer aided translation)
Philipp Koehn Machine Translation 8 April 2015