introduction to statistical machine overview of the course ... · expert-driven rule-based systems...
TRANSCRIPT
Introduction to Statistical Machine
TranslationCLT310
Jörg TiedemannUniversity of Helsinki
Outline for Today
MotivationOverview of the courseBrief historical backgroundMT evaluation
Machine TranslationMachine Translation
初⦌␂⼪⦌棔㧉⧉♙␅┭⏻⸳⧖㘴噆⚜呹䱿㼨⦿棎㕘↾⟕㕘䤊䷘♠⒉䤓䟄挽ↅ᧨Ⲑ卐⺕↩⚠㧉⧉䷘⏻↦⦿㡈♠┷䞮▥嬼⒊㈛᧨␂⼪兞≬㖐浧ㄵ㒡ᇭ
The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.
Computational Linguistics
meaning
unde
rsta
ndin
g speaking
source language target languagehuman languages
Machine Translation
ComputationalModels
Why Machine Translation?7KRXVDQGV�RI�/DQJXDJHV�$UH�6SRNHQ
MANDARIN 885,000,000SPANISH 332,000,000ENGLISH 322,000,000BENGALI 189,000,000
HINDI 182,000,000PORTUGUESE 170,000,000RUSSIAN 170,000,000JAPANESE 125,000,000GERMAN 98,000,000
WU (China) 77,175,000JAVANESE 75,500,800KOREAN 75,000,000FRENCH 72,000,000VIETNAMESE 67,662,000
TELUGU 66,350,000YUE (China) 66,000,000MARATHI 64,783,000TAMIL 63,075,000
TURKISH 59,000,000URDU 58,000,000MIN NAN (China) 49,000,000JINYU (China) 45,000,000
GUJARATI 44,000,000POLISH 44,000,000ARABIC 42,500,000UKRAINIAN 41,000,000
ITALIAN 37,000,000XIANG (China) 36,015,000MALAYALAM 34,022,000HAKKA (China) 34,000,000
KANNADA 33,663,000ORIYA 31,000,000PANJABI 30,000,000SUNDA 27,000,000
Source: Ethnologue
Why Machine Translation?
Sources: W3Techs.com, Internet World Stats, WorldWideWebSize.com
> 2 billion Internet users> 550 million registered domains > 12 billion indexed web pages
Websites
Internet Users
English:57%
English:27%
Others:17%
Others: 10%
Chinese: 5%
Chinese: 25%
MT is a Tough Challenge (and Fun)
Translation errors may be quite severe:• Doctor’s office: Specialist in women and other diseases• Pub: Ladies are requested not to have children in the bar• Hotel: Please leave your values at the front desk• Chinese dining hall: Translation server error
MT is not a solved problem .... but constantly improves• Input: Vem vann Allsvenskan i fjol?• Google 2010: Who stole headlines last year?• Google 2013: Who won the Championship last year?
MT and Other Language Technologylanguage identification
linguistic analysis
speechsynthesis
translation
synonyms
handwritten textrecognition
term definition anddisambiguation
speech recognition
spelling correction
Machine Translation in Real Life MT is a Cool Research Topic
How does human language work?• What are the differences between languages?• How can we preserve meaning when translating?
Complex but natural task• Natural input, natural output• Not fixed to particular theory or formalism
Combines various aspects of computational linguistics• analyze text or speech• understand/transfer meaning• generate text or speech
What is the Problem with MT?
Unrealistic expectations (especially in early days)• “MT is a waste of time because you will never make a
machine that can translate Shakespeare”• MT is useless because it may translate “The spirit is
willing but the flesh is weak” into the Russian equivalent of “The vodka is good, but the steak is lousy”
Unexpected (not humanlike) errors• German: Fussball ist langweilig. Tore gibt es selten.• Google 2012: Fotboll är tråkigt. Gates är sällsynta.
Why is Translation so Difficult?
Natural languages are ambiguous• I’ll get a cup of coffee• I didn’t get the joke• I get up at 8am• I get nervous• Yeah, I get around
Natural languages are different• concepts, morphology, syntax• style, culture, etymology
Natural Redundancy and Variation in Languages
Lexical Ambiguities across Languages How do Humans do it?
Human translators need• to understand the source language• to know how to speak the target language (well)• knowledge about the topic of the text to be translated• knowledge about culture, values, traditions and
expectations of speakers in both languagesCorresponding NLP challenges
• Natural language understanding• Language generation• Topic detection and domain adaptation
Is it Possible at all?
general purposebrowsing quality
post-editingediting quality
sublanguagepublishing quality
fully automaticGisting
computer-aidedtranslation (CAT)
fully automaticFAHQMT
on-line service localization domain-specific tasks
balance MT quality and input restrictions, depending on task
Course Overview
Lectures (Mondays and Thursdays)• Introduction of main MT approaches• Basics of Statistical MT and Word-Based Models• Phrase-Based SMT and Factored Models• Document-Level Models• Hierarchical Models
Assignments• Practical tasks• Seminar (presentations by students)• Individual projects (master students)
Examination and Grading
Taking 3 credits• active participation and assignments• four lab reports (25% each)
Taking 5 credits• lab reports (75%)• additional presentation about a selected topic (25%)
Taking 7 credits• lab reports (50%)• presentation (25%)• project work, presentation and final paper (25%)
Course Information
Website: http://www.helsinki.fi/~tiedeman/teach/clt310
Literature:• Philipp Koehn: Statistical Machine Translation• other (on-line) material
Resources:• CSC account (Taito shell)• Moodle for discussions and submissions• The Internet and your fellow students
Road Map
18.01. Introduction and MT evaluation• Reading: Koehn ch1, ch8; JM 25.1; Hutchins• Assignment 1: MT evaluation
21.01. Word-based SMT• Reading: Koehn ch4, ch7; KK97• Assignment 2: Word-based SMT
25.01. Parallel corpora and alignment• Reading: Koehn ch2-4, JT ch3-4, KK97, KK99• Assignment 3: Alignment
28.01. Phrase-based SMT• Reading: Koehn ch5
01.02. Decoding and tuning• Reading: Koehn ch6• Assignment 4: Phrase-based SMT
04.02. Document-level SMT• Reading: CH14: ch2, ch4, ch8
08.02. Factored models• Reading: Koehn ch10
15.02. Hierarchical models• Reading: Koehn ch11, DH06, DH12
18.02. SMT project and discussions29.02. seminar (selected topics)03.03. seminar (project reports)
A brief History of Machine Translations and Classical Translation Models
The Vauquois Triangle
source language target language
interlingua
abst
ract
ion
production
morphology
syntax
meaning
MT as Decoding
When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols.I will now proceed to decode.
[Weaver, 1947, 1949]
Direct Machine Translation
source language target language
interlingua
direct transfer
possibly some morphologicalanalysis
morphologicalgeneration
Linguistically Motivated MT
Assume that human translators work like this:• Understand the original message• Transfer meaning to the target language context• Produce a grammatical message in the target language
Transfer-based Machine Translation• Source language analysis• Transfer abstract representation• Generation of target language text
Expert-Driven Rule-Based Systems
gram
mar
rul
es gramm
ar rules
transfer rules
15
3.3 Synchronous Context Free Grammar (SCFG)
SMT systems often produce ungrammatical and disfluent translation hypotheses.
This is mainly because of language models that only look at local contexts (n-gram
models). Syntax-based approaches to machine translation aim at fixing the problem of
producing such disfluent outputs. Such syntax-based systems may take advantage of
Synchronous Context Free Grammars (SCFG): mappings between two context free
rules that specify how source language phrase structures correspond to target language
phrase structures. For example, the following basic rule maps noun phrases in the
source and target languages, and defines the order of the daughter constituents (Noun
and Adjective) in both languages.
NP::NP [Det ADJ N] Æ [Det N ADJ]
This rule maps English NPs to French NPs, stating that an English NP is constructed
from daughters Det, ADJ and N, while the French NP is constructed from daughters
Det, N and ADJ, in this order. Figure 3a demonstrates translation using transfer of such
derivation trees from English to French using this SCFG rule and lexical rules.
Figure 3a – translation of the English NP “The red house” into the French NP
“La maison rouge” using SCFG
N Æ house ADJ Æ red Det Æ the NP Æ Det ADJ N
N Æ maison ADJ Æ rouge Det Æ la NP Æ Det N ADJ
la maison rouge
Det N ADJ
NP
the red house
Det ADJ N
NP
VP → PP[+Goal] V ⇒ VP → V PP[+Goal] Text
Why doesn’t it work well?
• Static categorical models• But languages are dynamic,
ambiguous, distributional
source language target language
meaning
Expert-Driven Rule-Based Systems
ALPAC report (1966)• MT quality is too low• No advantage of MT over human translation• Almost all funding was stopped in the U.S.
Problems:• Robustness and translation speed• Expensive development• Not very flexible (new domains, languages ...)
Human Second Language Learning
Traditional Approach Emphasises• training vocabulary• teaching a dense grammar
Modern Language Teaching Emphasises• learning by doing• conversation• repetition, practising• listen to native speakers (abroad)
strongly supervisedexpert-driven
un- / semi-superviseddata-driven
Human Second Language Learning
strongsupervision
un- / semi-supervised
Data-Driven Machine Translation
source language target language
meaning
decoding
MT model
human translations
Data-Driven Machine Translation
Hmm, every time he sees “banco”, he either types “bank” or “bench” but if he sees “banco de ”,he always types “bank”, never “bench”
Man, this is so boring.
Translated documents
Data-Driven Machine Translation
Hmm, every time he sees “banco”, he either types “bank” or “bench” but if he sees “banco de ”,he always types “bank”, never “bench”
Man, this is so boring.
Translated documents
target language data
translationmodeling
languagemodelingLearning Algorithm
Statistical Machine Translation
1947: MT as decoding (Warren Weaver)1988: Word-based models1999: Public implementation of alignment models (GIZA)2003: Phrase-based SMT2004: Public phrase-based decoder (Pharaoh)2005: Hierarchical models2007: Moses (end-to-end SMT toolbox)2014: Neural machine translation
along with many tools, much more data and better computers
The Advantage of Data-Driven MT
Human Translations Naturally Appear• no need for artificial annotation• can be provided by non-experts
Implicit Linguistics• translation knowledge is in the data• distributional relations within and across languages
Constant Learning is Possible• feed with new data as they appear• quickly adapt to new domains and language pairs
Now Something Different:MT Evaluation
(based on slides by Philipp Koehn)
Why Evaluation?
How good is a given system?• general evaluation and assessment
Which one is the best system for a specific purpose?• task-specific evaluation
How much did we improve our system?• system development
How can we tune our system to become better?• parameter estimation
But MT evaluation is a difficult problem!
What Is The Problem?
A typical example from the 2001 NIST evaluation set:Ten Translations of a Chinese Sentence
Israeli o�cials are responsible for airport security.Israel is in charge of the security at this airport.The security work for this airport is the responsibility of the Israel government.Israeli side was in charge of the security of this airport.Israel is responsible for the airport’s security.Israel is responsible for safety work at this airport.Israel presides over the security of the airport.Israel took charge of the airport security.The safety of this airport is taken charge of by Israel.This airport’s security is the responsibility of the Israeli security o�cials.
(a typical example from the 2001 NIST evaluation set)
Evaluation of Machine Translation 2
Evaluation Metrics
Subjective judgements by human evaluators• translation quality• grammaticality and style• inter-annotator agreement
Automatic evaluation metrics• based on reference translations• linguistic resources to account for natural variation
Task-based evaluation, e.g.• estimate post-editing effort• information preservation for cross-lingual IR
Human versus Automatic Evaluation
Human evaluation• is ultimately what we are interested in• is VERY time consuming (and boring)• is not re-usable
Automatic evaluation• is cheap and re-usable• but is not necessarily reliable
Adequacy and Fluency
Human judgements based on• machine translation output• source language input and/or reference translations
Adequacy• Does the output convey the same meaning as the input
sentence? Is part of the message lost, added, or distorted?
Fluency• Is the output good fluent English? This involves both
grammatical correctness and idiomatic word choices.
Adequacy and Fluency: ScalesFluency and Adequacy: Scales
Adequacy Fluency5 all meaning 5 flawless English4 most meaning 4 good English3 much meaning 3 non-native English2 little meaning 2 disfluent English1 none 1 incomprehensible
Evaluation of Machine Translation 6
Evaluators DisagreeEvaluators Disagree
• Histogram of adequacy judgments by di↵erent human evaluators
1 2 3 4 5
10%
20%
30%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(from WMT 2006 evaluation)
Evaluation of Machine Translation 8
Histogram of adequacy judgments by different evaluators:
Ranking instead of Absolute ScalesRanking Translations
• Reference: These tissues are analyzed, processed and frozen before being stored at Hema-
Quebec, which manages also the only bank of placental blood in Quebec.
Rank Sentences
You have judged 25 sentences for WMT09 Spanish-English News Corpus, 427 sentences total taking 64.9 seconds per
sentence.
Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema-Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec.
Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec.
Translation Rank
These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec.
1
Best
2 3 4 5
Worst
These tissues analysed, processed and before frozen of stored in Hema-Québec, which also operates the only public bank umbilical cord blood in Quebec.
1
Best
2 3 4 5
Worst
These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec.
1
Best
2 3 4 5
Worst
These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec.
1
Best
2 3 4 5
Worst
These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec.
1
Best
2 3 4 5
Worst
Annotator: ccb Task: WMT09 Spanish-English News Corpus
Instructions:
Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are relative scores.Evaluation of Machine Translation 9
Ranking instead of Absolute ScalesRanking Translations
• Reference: These tissues are analyzed, processed and frozen before being stored at Hema-
Quebec, which manages also the only bank of placental blood in Quebec.
Evaluation of Machine Translation 10
Automatic Evaluation Metrics
Goal:• program that computes the quality of translations
Advantages:• low cost, fast, re-usable
Basic strategy• given: machine translation output• given: human reference translation• task: compute similarity between them
Precision and Recall of WordsPrecision and Recall of Words
Israeli officials responsibility of airport safety
Israeli officials are responsible for airport securityREFERENCE:
SYSTEM A:
• Precision correct
output-length
=3
6= 50%
• Recall correct
reference-length
=3
7= 43%
• F-measureprecision⇥ recall
(precision+ recall)/2=
.5⇥ .43
(.5 + .43)/2= 46%
Evaluation of Machine Translation 12
Precision and RecallPrecision and Recall
Israeli officials responsibility of airport safety
Israeli officials are responsible for airport securityREFERENCE:
SYSTEM A:
airport security Israeli officials are responsibleSYSTEM B:
Metric System A System Bprecision 50% 100%recall 43% 100%
f-measure 46% 100%
flaw: no penalty for reordering
Evaluation of Machine Translation 13
Word Error Rate
Word Error Rate
• Minimum number of editing steps to transform output to reference
match: words match, no costsubstitution: replace one word with anotherinsertion: add worddeletion: drop word
• Levenshtein distance
wer =substitutions+ insertions+ deletions
reference-length
Evaluation of Machine Translation 14
Minimum number of editing steps to transform output to referenceLevenshtein distance:
• match: words match, no cost• substitution: replace one word with another• insertion: add word• deletion: drop word
ExampleExample
officials
Israeli
responsibility
of
safety
airport
0 1Israeli 2 3 4 5
1officials 1 2 3 4
2 1are 1 2 3 4
3 2responsible 2 3 4
4 3for 3 3 3 4
5 4airport 4 4 4
6 5security 5 5 4 4
0
3
2
Israeli
2officials
3are
4responsible
5for
airport
6security
airport
1
2
3
4
5
6
security
2
3
3
4
5
6
6
Israeli
3
4
5
6
7
officials
3
3
3
4
5
6
are
4
4
3
3
4
5
responsible
52
2
5
5
2
2
1 2 4 5 63
2
3
4
5
7
1
0
6
1 2 3 4 5 60
1
2
3
4
5
6
7
Metric System A System Bword error rate (wer) 57% 71%
Evaluation of Machine Translation 15
BLEU (Bilingual Evaluation Understudy)BLEU
• N-gram overlap between machine translation output and reference translation
• Compute geometric mean of n-gram precisions (typically size 1 to 4):
P = np
precision1 ⇤ precision2 ⇤ ... ⇤ precisionn
= (precision1 ⇤ precision2 ⇤ ... ⇤ precisionn)1n =
nY
i=1
precisioni
!1n
• Add brevity penalty for short translations:
BP
⇢1 if output-length c > reference-length r
exp(1� r/c) if output-length c reference-length r
Evaluation of Machine Translation 16
N-gram overlap between MT output and reference translationGeometric mean of n-gram precisions (typically 1 to 4)
Additional brevity penalty for short translation (recall)
BLEU
• N-gram overlap between machine translation output and reference translation
• Compute geometric mean of n-gram precisions (typically size 1 to 4):
P = np
precision1 ⇤ precision2 ⇤ ... ⇤ precisionn
= (precision1 ⇤ precision2 ⇤ ... ⇤ precisionn)1n =
nY
i=1
precisioni
!1n
• Add brevity penalty for short translations:
BP
⇢1 if output-length c > reference-length r
exp(1� r/c) if output-length c reference-length r
Evaluation of Machine Translation 16
BLEU (Bilingual Evaluation Understudy)
BLEU
• More e�cient: P = (Qn
i=1 precisioni)1n = exp
�1n
Pni=1 loge(precisonn)
�
• Putting everything together (for 1 to 4-grams):
BLEU = min
✓1, exp
✓1� reference-length
output-length
◆◆exp
1
n
NX
n=1
loge(precisonn)
!
• Typically computed over the entire test corpus, not single sentences
• Can you figure out why?
Evaluation of Machine Translation 17
Final metric:
Computed over entire test corpus (not single sentences)• Can you figure out why?
ExampleExample
airport security Israeli officials are responsible
Israeli officials responsibility of airport safety
Israeli officials are responsible for airport securityREFERENCE:
SYSTEM A:
SYSTEM B:
4-GRAM MATCH2-GRAM MATCH
2-GRAM MATCH 1-GRAM MATCH
Metric System A System Bprecision (1gram) 3/6 6/6precision (2gram) 1/5 4/5precision (3gram) 0/4 2/4precision (4gram) 0/3 1/3brevity penalty 6/7 6/7
bleu 0% 52%
Evaluation of Machine Translation 18
Modified N-gram PrecisionModified N-gram Precision
Avoid counting correct N-grams more often than they appear in any referencetranslation!
countclip = min (countcandidate,maxcountreference)
Candidate: the the the the the the the.Reference 1: The cat is on the mat.Reference 2: There is a cat on the mat.
countclip(the) = 2precision1 = 2/7 (unigram precision)
Evaluation of Machine Translation 19
Avoid counting correct N-grams more often than they appear in any reference translation!
Multiple Reference Translations
Account for variability: Use multiple reference translations• N-grams may match in any of the reference• closest reference length is used for brevity penalties
Example:
Multiple Reference Translations
• To account for variability, use multiple reference translations
– n-grams may match in any of the references– closest reference length used
• Example
Israeli officials responsibility of airport safety
Israeli officials are responsible for airport securityIsrael is in charge of the security at this airport
The security work for this airport is the responsibility of the Israel governmentIsraeli side was in charge of the security of this airport
REFERENCES:
SYSTEM:2-GRAM MATCH 1-GRAM2-GRAM MATCH
Evaluation of Machine Translation 20
Correlation with Human Judgement
Correlation with Human Judgement
Evaluation of Machine Translation 21
Typical BLEU Scores
BLEU scores for 110 SMT systems (Koehn 2005)Typical BLEU Scores
BLEU scores for 110 statistical machine translation systems (Koehn 2005)
% da de el en es fr fi it nl pt svda - 18.4 21.1 28.5 26.4 28.7 14.2 22.2 21.4 24.3 28.3de 22.3 - 20.7 25.3 25.4 27.7 11.8 21.3 23.4 23.2 20.5el 22.7 17.4 - 27.2 31.2 32.1 11.4 26.8 20.0 27.6 21.2en 25.2 17.6 23.2 - 30.1 31.1 13.0 25.3 21.0 27.1 24.8es 24.1 18.2 28.3 30.5 - 40.2 12.5 32.3 21.4 35.9 23.9fr 23.7 18.5 26.1 30.0 38.4 - 12.6 32.4 21.1 35.3 22.6fi 20.0 14.5 18.2 21.8 21.1 22.4 - 18.3 17.0 19.1 18.8it 21.4 16.9 24.8 27.8 34.0 36.0 11.0 - 20.0 31.2 20.2nl 20.5 18.3 17.4 23.0 22.9 24.6 10.3 20.0 - 20.7 19.0pt 23.2 18.2 26.4 30.1 37.9 39.0 11.9 32.0 20.2 - 21.9sv 30.3 18.9 22.8 30.2 28.6 29.7 15.3 23.9 21.9 25.9 -
Evaluation of Machine Translation 22
METEOR: Flexible Matching
Partial credit for matching lemmas (morphological variation)
Partial credit for matching synonyms (lexical variation)
Use of paraphrases
METEOR: Flexible Matching
• Partial credit for matching lemmas
system Jim went homereference Joe goes home
• Partial credit for matching (near) synonyms
system Jim walks homereference Joe goes home
• Use of paraphrases
Evaluation of Machine Translation 23
METEOR: Flexible Matching
• Partial credit for matching lemmas
system Jim went homereference Joe goes home
• Partial credit for matching (near) synonyms
system Jim walks homereference Joe goes home
• Use of paraphrases
Evaluation of Machine Translation 23
Metric Research
Active development of new metrics• syntactic similarity• semantic equivalence or entailment• metrics targeted at sub-problems (re-ordering)• tunable (parametric) metrics• ...
On-going evaluation campaigns that rank new proposals
Summary
Data-driven MT vs expert-driven MT• re-use linguistic information hidden in data• design models and learning algorithms instead of MT• translation as a search problem (decoding)
MT evaluation• important for checking quality• essential for system development• necessary for parameter tuning
Human evaluators are expensive and disagreeAutomatic evaluation is cheap but not always reliable