introduction to statistical machine overview of the course ... · expert-driven rule-based systems...

Introduction to Statistical Machine

TranslationCLT310

Jörg TiedemannUniversity of Helsinki

Outline for Today

MotivationOverview of the courseBrief historical backgroundMT evaluation

Machine TranslationMachine Translation

初⦌␂⼪⦌棔㧉⧉♙␅┭⏻⸳⧖㘴噆⚜呹䱿㼨⦿棎㕘↾⟕㕘䤊䷘♠⒉䤓䟄挽ↅ᧨Ⲑ卐⺕↩⚠㧉⧉䷘⏻↦⦿㡈♠┷䞮▥嬼⒊㈛᧨␂⼪兞≬㖐浧ㄵ㒡ᇭ

The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.

Computational Linguistics

meaning

unde

rsta

ndin

g speaking

source language target languagehuman languages

Machine Translation

ComputationalModels

Why Machine Translation?7KRXVDQGV�RI�/DQJXDJHV�$UH�6SRNHQ

MANDARIN 885,000,000SPANISH 332,000,000ENGLISH 322,000,000BENGALI 189,000,000

HINDI 182,000,000PORTUGUESE 170,000,000RUSSIAN 170,000,000JAPANESE 125,000,000GERMAN 98,000,000

WU (China) 77,175,000JAVANESE 75,500,800KOREAN 75,000,000FRENCH 72,000,000VIETNAMESE 67,662,000

TELUGU 66,350,000YUE (China) 66,000,000MARATHI 64,783,000TAMIL 63,075,000

TURKISH 59,000,000URDU 58,000,000MIN NAN (China) 49,000,000JINYU (China) 45,000,000

GUJARATI 44,000,000POLISH 44,000,000ARABIC 42,500,000UKRAINIAN 41,000,000

ITALIAN 37,000,000XIANG (China) 36,015,000MALAYALAM 34,022,000HAKKA (China) 34,000,000

KANNADA 33,663,000ORIYA 31,000,000PANJABI 30,000,000SUNDA 27,000,000

Source: Ethnologue

Why Machine Translation?

Sources: W3Techs.com, Internet World Stats, WorldWideWebSize.com

> 2 billion Internet users> 550 million registered domains > 12 billion indexed web pages

Websites

Internet Users

English:57%

English:27%

Others:17%

Others: 10%

Chinese: 5%

Chinese: 25%

MT is a Tough Challenge (and Fun)

Translation errors may be quite severe:• Doctor’s office: Specialist in women and other diseases• Pub: Ladies are requested not to have children in the bar• Hotel: Please leave your values at the front desk• Chinese dining hall: Translation server error

MT is not a solved problem .... but constantly improves• Input: Vem vann Allsvenskan i fjol?• Google 2010: Who stole headlines last year?• Google 2013: Who won the Championship last year?

MT and Other Language Technologylanguage identification

linguistic analysis

speechsynthesis

translation

synonyms

handwritten textrecognition

term definition anddisambiguation

speech recognition

spelling correction

Machine Translation in Real Life MT is a Cool Research Topic

How does human language work?• What are the differences between languages?• How can we preserve meaning when translating?

Complex but natural task• Natural input, natural output• Not fixed to particular theory or formalism

Combines various aspects of computational linguistics• analyze text or speech• understand/transfer meaning• generate text or speech

What is the Problem with MT?

Unrealistic expectations (especially in early days)• “MT is a waste of time because you will never make a

machine that can translate Shakespeare”• MT is useless because it may translate “The spirit is

willing but the flesh is weak” into the Russian equivalent of “The vodka is good, but the steak is lousy”

Unexpected (not humanlike) errors• German: Fussball ist langweilig. Tore gibt es selten.• Google 2012: Fotboll är tråkigt. Gates är sällsynta.

Why is Translation so Difficult?

Natural languages are ambiguous• I’ll get a cup of coffee• I didn’t get the joke• I get up at 8am• I get nervous• Yeah, I get around

Natural languages are different• concepts, morphology, syntax• style, culture, etymology

Natural Redundancy and Variation in Languages

Lexical Ambiguities across Languages How do Humans do it?

Human translators need• to understand the source language• to know how to speak the target language (well)• knowledge about the topic of the text to be translated• knowledge about culture, values, traditions and

expectations of speakers in both languagesCorresponding NLP challenges

• Natural language understanding• Language generation• Topic detection and domain adaptation

Is it Possible at all?

general purposebrowsing quality

post-editingediting quality

sublanguagepublishing quality

fully automaticGisting

computer-aidedtranslation (CAT)

fully automaticFAHQMT

on-line service localization domain-specific tasks

balance MT quality and input restrictions, depending on task

Course Overview

Lectures (Mondays and Thursdays)• Introduction of main MT approaches• Basics of Statistical MT and Word-Based Models• Phrase-Based SMT and Factored Models• Document-Level Models• Hierarchical Models

Assignments• Practical tasks• Seminar (presentations by students)• Individual projects (master students)

Examination and Grading

Taking 3 credits• active participation and assignments• four lab reports (25% each)

Taking 5 credits• lab reports (75%)• additional presentation about a selected topic (25%)

Taking 7 credits• lab reports (50%)• presentation (25%)• project work, presentation and final paper (25%)

Course Information

Website: http://www.helsinki.fi/~tiedeman/teach/clt310

Literature:• Philipp Koehn: Statistical Machine Translation• other (on-line) material

Resources:• CSC account (Taito shell)• Moodle for discussions and submissions• The Internet and your fellow students

Road Map

18.01. Introduction and MT evaluation• Reading: Koehn ch1, ch8; JM 25.1; Hutchins• Assignment 1: MT evaluation

21.01. Word-based SMT• Reading: Koehn ch4, ch7; KK97• Assignment 2: Word-based SMT

25.01. Parallel corpora and alignment• Reading: Koehn ch2-4, JT ch3-4, KK97, KK99• Assignment 3: Alignment

28.01. Phrase-based SMT• Reading: Koehn ch5

01.02. Decoding and tuning• Reading: Koehn ch6• Assignment 4: Phrase-based SMT

04.02. Document-level SMT• Reading: CH14: ch2, ch4, ch8

08.02. Factored models• Reading: Koehn ch10

15.02. Hierarchical models• Reading: Koehn ch11, DH06, DH12

18.02. SMT project and discussions29.02. seminar (selected topics)03.03. seminar (project reports)

A brief History of Machine Translations and Classical Translation Models

The Vauquois Triangle

source language target language

interlingua

abst

ract

ion

production

morphology

syntax

meaning

MT as Decoding

When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols.I will now proceed to decode.

[Weaver, 1947, 1949]

Direct Machine Translation


interlingua

direct transfer

possibly some morphologicalanalysis

morphologicalgeneration

Linguistically Motivated MT

Assume that human translators work like this:• Understand the original message• Transfer meaning to the target language context• Produce a grammatical message in the target language

Transfer-based Machine Translation• Source language analysis• Transfer abstract representation• Generation of target language text

Expert-Driven Rule-Based Systems

gram

mar

rul

es gramm

ar rules

transfer rules

15

3.3 Synchronous Context Free Grammar (SCFG)

SMT systems often produce ungrammatical and disfluent translation hypotheses.

This is mainly because of language models that only look at local contexts (n-gram

models). Syntax-based approaches to machine translation aim at fixing the problem of

producing such disfluent outputs. Such syntax-based systems may take advantage of

Synchronous Context Free Grammars (SCFG): mappings between two context free

rules that specify how source language phrase structures correspond to target language

phrase structures. For example, the following basic rule maps noun phrases in the

source and target languages, and defines the order of the daughter constituents (Noun

and Adjective) in both languages.

NP::NP [Det ADJ N] Æ [Det N ADJ]

This rule maps English NPs to French NPs, stating that an English NP is constructed

from daughters Det, ADJ and N, while the French NP is constructed from daughters

Det, N and ADJ, in this order. Figure 3a demonstrates translation using transfer of such

derivation trees from English to French using this SCFG rule and lexical rules.

Figure 3a – translation of the English NP “The red house” into the French NP

“La maison rouge” using SCFG

N Æ house ADJ Æ red Det Æ the NP Æ Det ADJ N

N Æ maison ADJ Æ rouge Det Æ la NP Æ Det N ADJ

la maison rouge

Det N ADJ

NP

the red house

Det ADJ N

NP

VP → PP[+Goal] V ⇒ VP → V PP[+Goal] Text

Why doesn’t it work well?

• Static categorical models• But languages are dynamic,

ambiguous, distributional


meaning

Expert-Driven Rule-Based Systems

ALPAC report (1966)• MT quality is too low• No advantage of MT over human translation• Almost all funding was stopped in the U.S.

Problems:• Robustness and translation speed• Expensive development• Not very flexible (new domains, languages ...)

Human Second Language Learning

Traditional Approach Emphasises• training vocabulary• teaching a dense grammar

Modern Language Teaching Emphasises• learning by doing• conversation• repetition, practising• listen to native speakers (abroad)

strongly supervisedexpert-driven

un- / semi-superviseddata-driven

Human Second Language Learning

strongsupervision

un- / semi-supervised

Data-Driven Machine Translation


meaning

decoding

MT model

human translations


Hmm, every time he sees “banco”, he either types “bank” or “bench” but if he sees “banco de ”,he always types “bank”, never “bench”

Man, this is so boring.

Translated documents


Hmm, every time he sees “banco”, he either types “bank” or “bench” but if he sees “banco de ”,he always types “bank”, never “bench”

Man, this is so boring.

Translated documents

target language data

translationmodeling

languagemodelingLearning Algorithm

Statistical Machine Translation

1947: MT as decoding (Warren Weaver)1988: Word-based models1999: Public implementation of alignment models (GIZA)2003: Phrase-based SMT2004: Public phrase-based decoder (Pharaoh)2005: Hierarchical models2007: Moses (end-to-end SMT toolbox)2014: Neural machine translation

along with many tools, much more data and better computers

The Advantage of Data-Driven MT

Human Translations Naturally Appear• no need for artificial annotation• can be provided by non-experts

Implicit Linguistics• translation knowledge is in the data• distributional relations within and across languages

Constant Learning is Possible• feed with new data as they appear• quickly adapt to new domains and language pairs

Now Something Different:MT Evaluation

(based on slides by Philipp Koehn)

Why Evaluation?

How good is a given system?• general evaluation and assessment

Which one is the best system for a specific purpose?• task-specific evaluation

How much did we improve our system?• system development

How can we tune our system to become better?• parameter estimation

But MT evaluation is a difficult problem!

What Is The Problem?

A typical example from the 2001 NIST evaluation set:Ten Translations of a Chinese Sentence

Israeli o�cials are responsible for airport security.Israel is in charge of the security at this airport.The security work for this airport is the responsibility of the Israel government.Israeli side was in charge of the security of this airport.Israel is responsible for the airport’s security.Israel is responsible for safety work at this airport.Israel presides over the security of the airport.Israel took charge of the airport security.The safety of this airport is taken charge of by Israel.This airport’s security is the responsibility of the Israeli security o�cials.

(a typical example from the 2001 NIST evaluation set)

Evaluation of Machine Translation 2

Evaluation Metrics

Subjective judgements by human evaluators• translation quality• grammaticality and style• inter-annotator agreement

Automatic evaluation metrics• based on reference translations• linguistic resources to account for natural variation

Task-based evaluation, e.g.• estimate post-editing effort• information preservation for cross-lingual IR

Human versus Automatic Evaluation

Human evaluation• is ultimately what we are interested in• is VERY time consuming (and boring)• is not re-usable

Automatic evaluation• is cheap and re-usable• but is not necessarily reliable

Adequacy and Fluency

Human judgements based on• machine translation output• source language input and/or reference translations

Adequacy• Does the output convey the same meaning as the input

sentence? Is part of the message lost, added, or distorted?

Fluency• Is the output good fluent English? This involves both

grammatical correctness and idiomatic word choices.

Adequacy and Fluency: ScalesFluency and Adequacy: Scales

Adequacy Fluency5 all meaning 5 flawless English4 most meaning 4 good English3 much meaning 3 non-native English2 little meaning 2 disfluent English1 none 1 incomprehensible


Evaluators DisagreeEvaluators Disagree

• Histogram of adequacy judgments by di↵erent human evaluators

1 2 3 4 5

10%

20%

30%

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

(from WMT 2006 evaluation)


Histogram of adequacy judgments by different evaluators:

Ranking instead of Absolute ScalesRanking Translations

• Reference: These tissues are analyzed, processed and frozen before being stored at Hema-

Quebec, which manages also the only bank of placental blood in Quebec.

Rank Sentences

You have judged 25 sentences for WMT09 Spanish-English News Corpus, 427 sentences total taking 64.9 seconds per

sentence.

Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema-Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec.

Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec.

Translation Rank

These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

These tissues analysed, processed and before frozen of stored in Hema-Québec, which also operates the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

Annotator: ccb Task: WMT09 Spanish-English News Corpus

Instructions:

Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are relative scores.Evaluation of Machine Translation 9

Ranking instead of Absolute ScalesRanking Translations

• Reference: These tissues are analyzed, processed and frozen before being stored at Hema-

Quebec, which manages also the only bank of placental blood in Quebec.


Automatic Evaluation Metrics

Goal:• program that computes the quality of translations

Advantages:• low cost, fast, re-usable

Basic strategy• given: machine translation output• given: human reference translation• task: compute similarity between them

Precision and Recall of WordsPrecision and Recall of Words

Israeli officials responsibility of airport safety

Israeli officials are responsible for airport securityREFERENCE:

SYSTEM A:

• Precision correct

output-length

=3

6= 50%

• Recall correct

reference-length

=3

7= 43%

• F-measureprecision⇥ recall

(precision+ recall)/2=

.5⇥ .43

(.5 + .43)/2= 46%


Precision and RecallPrecision and Recall



SYSTEM A:

airport security Israeli officials are responsibleSYSTEM B:

Metric System A System Bprecision 50% 100%recall 43% 100%

f-measure 46% 100%

flaw: no penalty for reordering


Word Error Rate

Word Error Rate

• Minimum number of editing steps to transform output to reference

match: words match, no costsubstitution: replace one word with anotherinsertion: add worddeletion: drop word

• Levenshtein distance

wer =substitutions+ insertions+ deletions

reference-length


Minimum number of editing steps to transform output to referenceLevenshtein distance:

• match: words match, no cost• substitution: replace one word with another• insertion: add word• deletion: drop word

ExampleExample

officials

Israeli

responsibility

of

safety

airport

0 1Israeli 2 3 4 5

1officials 1 2 3 4

2 1are 1 2 3 4

3 2responsible 2 3 4

4 3for 3 3 3 4

5 4airport 4 4 4

6 5security 5 5 4 4

0

3

2

Israeli

2officials

3are

4responsible

5for

airport

6security

airport

1

2

3

4

5

6

security

2

3

3

4

5

6

6

Israeli

3

4

5

6

7

officials

3

3

3

4

5

6

are

4

4

3

3

4

5

responsible

52

2

5

5

2

2

1 2 4 5 63

2

3

4

5

7

1

0

6

1 2 3 4 5 60

1

2

3

4

5

6

7

Metric System A System Bword error rate (wer) 57% 71%


BLEU (Bilingual Evaluation Understudy)BLEU

• N-gram overlap between machine translation output and reference translation

• Compute geometric mean of n-gram precisions (typically size 1 to 4):

P = np

precision1 ⇤ precision2 ⇤ ... ⇤ precisionn

= (precision1 ⇤ precision2 ⇤ ... ⇤ precisionn)1n =

nY

i=1

precisioni

!1n

• Add brevity penalty for short translations:

BP

⇢1 if output-length c > reference-length r

exp(1� r/c) if output-length c reference-length r


N-gram overlap between MT output and reference translationGeometric mean of n-gram precisions (typically 1 to 4)

Additional brevity penalty for short translation (recall)

BLEU

• N-gram overlap between machine translation output and reference translation

• Compute geometric mean of n-gram precisions (typically size 1 to 4):

P = np

precision1 ⇤ precision2 ⇤ ... ⇤ precisionn

= (precision1 ⇤ precision2 ⇤ ... ⇤ precisionn)1n =

nY

i=1

precisioni

!1n

• Add brevity penalty for short translations:

BP

⇢1 if output-length c > reference-length r

exp(1� r/c) if output-length c reference-length r


BLEU (Bilingual Evaluation Understudy)

BLEU

• More e�cient: P = (Qn

i=1 precisioni)1n = exp

�1n

Pni=1 loge(precisonn)

�

• Putting everything together (for 1 to 4-grams):

BLEU = min

✓1, exp

✓1� reference-length

output-length

◆◆exp

1

n

NX

n=1

loge(precisonn)

!

• Typically computed over the entire test corpus, not single sentences

• Can you figure out why?


Final metric:

Computed over entire test corpus (not single sentences)• Can you figure out why?

ExampleExample

airport security Israeli officials are responsible



SYSTEM A:

SYSTEM B:

4-GRAM MATCH2-GRAM MATCH

2-GRAM MATCH 1-GRAM MATCH

Metric System A System Bprecision (1gram) 3/6 6/6precision (2gram) 1/5 4/5precision (3gram) 0/4 2/4precision (4gram) 0/3 1/3brevity penalty 6/7 6/7

bleu 0% 52%


Modified N-gram PrecisionModified N-gram Precision

Avoid counting correct N-grams more often than they appear in any referencetranslation!

countclip = min (countcandidate,maxcountreference)

Candidate: the the the the the the the.Reference 1: The cat is on the mat.Reference 2: There is a cat on the mat.

countclip(the) = 2precision1 = 2/7 (unigram precision)


Avoid counting correct N-grams more often than they appear in any reference translation!

Multiple Reference Translations

Account for variability: Use multiple reference translations• N-grams may match in any of the reference• closest reference length is used for brevity penalties

Example:

Multiple Reference Translations

• To account for variability, use multiple reference translations

– n-grams may match in any of the references– closest reference length used

• Example


Israeli officials are responsible for airport securityIsrael is in charge of the security at this airport

The security work for this airport is the responsibility of the Israel governmentIsraeli side was in charge of the security of this airport

REFERENCES:

SYSTEM:2-GRAM MATCH 1-GRAM2-GRAM MATCH


Correlation with Human Judgement

Correlation with Human Judgement


Typical BLEU Scores

BLEU scores for 110 SMT systems (Koehn 2005)Typical BLEU Scores

BLEU scores for 110 statistical machine translation systems (Koehn 2005)

% da de el en es fr fi it nl pt svda - 18.4 21.1 28.5 26.4 28.7 14.2 22.2 21.4 24.3 28.3de 22.3 - 20.7 25.3 25.4 27.7 11.8 21.3 23.4 23.2 20.5el 22.7 17.4 - 27.2 31.2 32.1 11.4 26.8 20.0 27.6 21.2en 25.2 17.6 23.2 - 30.1 31.1 13.0 25.3 21.0 27.1 24.8es 24.1 18.2 28.3 30.5 - 40.2 12.5 32.3 21.4 35.9 23.9fr 23.7 18.5 26.1 30.0 38.4 - 12.6 32.4 21.1 35.3 22.6fi 20.0 14.5 18.2 21.8 21.1 22.4 - 18.3 17.0 19.1 18.8it 21.4 16.9 24.8 27.8 34.0 36.0 11.0 - 20.0 31.2 20.2nl 20.5 18.3 17.4 23.0 22.9 24.6 10.3 20.0 - 20.7 19.0pt 23.2 18.2 26.4 30.1 37.9 39.0 11.9 32.0 20.2 - 21.9sv 30.3 18.9 22.8 30.2 28.6 29.7 15.3 23.9 21.9 25.9 -


METEOR: Flexible Matching

Partial credit for matching lemmas (morphological variation)

Partial credit for matching synonyms (lexical variation)

Use of paraphrases


• Partial credit for matching lemmas

system Jim went homereference Joe goes home

• Partial credit for matching (near) synonyms

system Jim walks homereference Joe goes home

• Use of paraphrases



• Partial credit for matching lemmas

system Jim went homereference Joe goes home

• Partial credit for matching (near) synonyms

system Jim walks homereference Joe goes home

• Use of paraphrases


Metric Research

Active development of new metrics• syntactic similarity• semantic equivalence or entailment• metrics targeted at sub-problems (re-ordering)• tunable (parametric) metrics• ...

On-going evaluation campaigns that rank new proposals

Summary

Data-driven MT vs expert-driven MT• re-use linguistic information hidden in data• design models and learning algorithms instead of MT• translation as a search problem (decoding)

MT evaluation• important for checking quality• essential for system development• necessary for parameter tuning

Human evaluators are expensive and disagreeAutomatic evaluation is cheap but not always reliable

introduction to statistical machine overview of the course ... · expert-driven rule-based systems...

Documents