statistical machine translation: ibm models and the alignment template system

72
Statistical Machine Translation: IBM Models and the Alignment Template System

Upload: toby

Post on 05-Jan-2016

45 views

Category:

Documents


1 download

DESCRIPTION

Statistical Machine Translation: IBM Models and the Alignment Template System. Statistical Machine Translation. Goal: Given foreign sentence f : “Maria no dio una bofetada a la bruja verde” Find the most likely English translation e : “Maria did not slap the green witch”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Machine Translation: IBM Models and the Alignment Template System

Statistical Machine Translation: IBM Models and the Alignment

Template System

Page 2: Statistical Machine Translation: IBM Models and the Alignment Template System

Statistical Machine Translation

• Goal:• Given foreign sentence f:

• “Maria no dio una bofetada a la bruja verde”

• Find the most likely English translation e:• “Maria did not slap the green witch”

Page 3: Statistical Machine Translation: IBM Models and the Alignment Template System

Statistical Machine Translation

• Most likely English translation e is given by:

• P(e|f) estimates conditional probability of any e given f

)|(maxarg fepe

Page 4: Statistical Machine Translation: IBM Models and the Alignment Template System

Statistical Machine Translation

• How to estimate P(e|f)?• Noisy channel:

• Decompose P(e|f) into P(f|e) * P(e) / P(f)• Estimate P(f|e) and P(e) separately using parallel

corpus

• Direct: • Estimate P(e|f) directly using parallel corpus (more on

this later)

Page 5: Statistical Machine Translation: IBM Models and the Alignment Template System

Noisy Channel Model

• Translation Model• P(f|e)• How likely is f to be a translation of e?• Estimate parameters from bilingual corpus

• Language Model• P(e)• How likely is e to be an English sentence?• Estimate parameters from monolingual corpus

• Decoder

• Given f, what is the best translation e?

)|(maxarg fepe

Page 6: Statistical Machine Translation: IBM Models and the Alignment Template System

Noisy Channel Model

• Generative story:• Generate e with probability p(e)• Pass e through noisy channel• Out comes f with probability p(f|e)

• Translation task:• Given f, deduce most likely e that produced f, or:

)|(maxarg fepe

Page 7: Statistical Machine Translation: IBM Models and the Alignment Template System

Translation Model

• How to model P(f|e)?

• Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs <ei,fi> :

< e1,f1 > = <the blue witch, la bruja azul>

< e2,f2 > = <green, verde>

< eS,fS > = <the witch, la bruja>

Page 8: Statistical Machine Translation: IBM Models and the Alignment Template System

Translation Model

• Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?)

• Decompose process of translating e -> f into small steps whose probabilities can be estimated

Page 9: Statistical Machine Translation: IBM Models and the Alignment Template System

Translation Model

• English sentence e = e1…el

• Foreign sentence f = f1…fm

• Alignment A = {a1…am}, where aj ε {0…l}

• A indicates which English word generates each foreign word

Page 10: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

e: “the blue witch”

f: “la bruja azul”

A = {1,3,2} (intuitively “good” alignment)

Page 11: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

e: “the blue witch”

f: “la bruja azul”

A = {1,1,1} (intuitively “bad” alignment)

Page 12: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

e: “the blue witch”

f: “la bruja azul”

(illegal alignment!)

Page 13: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

• Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

Page 14: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

• Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

• Answer:• Each foreign word can align with any one of |

e| = l words, or it can remain unaligned• Each foreign word has (l + 1) choices for an

alignment, and there are |f| = m foreign words• So, there are (l+1)^m alignments for a given e

and f

Page 15: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

• Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

Page 16: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignments

• Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

• Answer:• P(A|e) = p(|f| = m) * 1/(l+1)^m• If we assume that p(|f| = m) is uniform over all

possible values of |f|, then we can let p(|f| = m) = C

• P(A|e) = C /(l+1)^m

Page 17: Statistical Machine Translation: IBM Models and the Alignment Template System

Generative Story

e: “blue witch”

f: “bruja azul”

? How do we get from e to f?

Page 18: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1

• Model parameters:• T(fj | eaj ) = translation probability of foreign

word given English word that generated it

Page 19: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1

• Generative story:• Given e:• Pick m = |f|, where all lengths m are equally probable• Pick A with probability P(A|e) = 1/(l+1)^m, since all

alignments are equally likely given l and m

• Pick f1…fm with probability

where T(fj | eaj ) is the translation probability of fj given the English word it is aligned to

m

jaj jefTeAfP

1)|(),|(

Page 20: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1 Example

e: “blue witch”

Page 21: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1 Example

e: “blue witch”

f: “f1 f2”

Pick m = |f| = 2

Page 22: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1 Example

e: blue witch”

f: “f1 f2”

Pick A = {2,1} with probability 1/(l+1)^m

Page 23: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1 Example

e: blue witch”

f: “bruja f2”

Pick f1 = “bruja” with probability t(bruja|witch)

Page 24: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1 Example

e: blue witch”

f: “bruja azul”

Pick f2 = “azul” with probability t(azul|blue)

Page 25: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1: Parameter Estimation

• How does this generative story help us to estimate P(f|e) from the data?

• Since the model for P(f|e) contains the parameter T(fj | eaj ), we first need to estimate T(fj | eaj )

Page 26: Statistical Machine Translation: IBM Models and the Alignment Template System

lBM Model 1: Parameter Estimation

• How to estimate T(fj | eaj ) from the data?

• If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

'' ),(

),()|(

jj

j

j

faj

aj

aj efCount

efCountefT

Page 27: Statistical Machine Translation: IBM Models and the Alignment Template System

lBM Model 1: Parameter Estimation

• How to estimate P(A|f,e)?• P(A|f,e) = P(A,f|e) / P(f|e)• But• So we need to compute P(A,f|e)…• This is given by the Model 1 generative story:

A

efAPefP )|,()|(

m

jajm jefT

lC

efAP1

)|(*)1(

)|,(

Page 28: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1 Example

e: “the blue witch”

f: “la bruja azul”

P(A|f,e) = P(f,A|e)/ P(f|e) =

j

ajAA j

efTC

blueazultwitchbrujatthelatC

)|(*4

)|(*)|(*)|(*4

3

3

Page 29: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1: Parameter Estimation

• So, in order to estimate P(f|e), we first need to estimate the model parameter

T(fj | eaj )

• In order to compute T(fj | eaj ) , we need to estimate P(A|f,e)

• And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…

Page 30: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 1: Parameter Estimation

• Training data is a set of pairs < ei, fi>

• Log likelihood of training data given model parameters is:

• To maximize log likelihood of training data given model parameters, use EM: • hidden variable = alignments A• model parameters = translation probabilities T

),|(*)|(log)|(log iii i A

i eAfPeAPefP

Page 31: Statistical Machine Translation: IBM Models and the Alignment Template System

EM

• Initialize model parameters T(f|e)• Calculate alignment probabilities P(A|f,e)

under current values of T(f|e)• Calculate expected counts from alignment

probabilities• Re-estimate T(f|e) from these expected

counts• Repeat until log likelihood of training data

converges to a maximum

Page 32: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 2

• Model parameters:• T(fj | eaj ) = translation probability of foreign

word fj given English word eaj that generated it

• d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m

Page 33: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3

• Model parameters:• T(fj | eaj ) = translation probability of foreign word fj

given English word eaj that generated it

• r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m

• n(ei) = fertility of word ei , or number of foreign words aligned to ei

• p1 = probability of generating a foreign word by alignment with the NULL English word

Page 34: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3

• Generative Story:• Choose fertilities for each English word• Insert spurious words according to probability

of being aligned to the NULL English word• Translate English words -> foreign words• Reorder words according to reverse distortion

probabilities

Page 35: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3 Example

• Consider the following example from [Knight 1999]:

• Maria did not slap the green witch

Page 36: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3 Example

• Maria did not slap the green witch

• Maria not slap slap slap the green witch

• Choose fertilities: phi(Maria) = 1

Page 37: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3 Example

• Maria did not slap the green witch

• Maria not slap slap slap the green witch

• Maria not slap slap slap NULL the green witch

• Insert spurious words: p(NULL)

Page 38: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3 Example

• Maria did not slap the green witch

• Maria not slap slap slap the green witch

• Maria not slap slap slap NULL the green witch

• Maria no dio una bofetada a la verde bruja

• Translate words: t(verde|green)

Page 39: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3 Example

• Maria no dio una bofetada a la verde bruja

• Maria no dio una bofetada a la bruja verde

• Reorder words

Page 40: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 3

• For models 1 and 2:• We can compute exact EM updates

• For models 3 and 4:• Exact EM updates cannot be efficiently

computed• Use best alignments from previous iterations

to initialize each successive model• Explore only the subspace of potential

alignments that lies within same neighborhood as the initial alignments

Page 41: Statistical Machine Translation: IBM Models and the Alignment Template System

IBM Model 4

• Model parameters:• Same as model 3, except uses more

complicated model of reordering (for details, see Brown et al. 1993)

Page 42: Statistical Machine Translation: IBM Models and the Alignment Template System

Language Model

• Given an English sentence e1, e2 …el :P(e1, e2 …el ) =

P(e1) *

P(e2|e1 ) * … *

P(el| e1, e2 …el-1 )

• N-gram model:• Assume P(ei) depends only on the N-1

previous words, so that P(ei |e1,e2, …ei-1) =

P(ei |ei-N,ei-N+1, …ei-1)

Page 43: Statistical Machine Translation: IBM Models and the Alignment Template System

N=2: Bigram Language Model

P(Maria did not slap the green witch) =

P(Maria|START) *

P(did|Maria) *

P(not|did) *

P(END|witch)

Page 44: Statistical Machine Translation: IBM Models and the Alignment Template System

Word-Based MT

• Word = fundamental unit of translation

• Weaknesses:• no explicit modeling of word context• word-by-word translation may not accurately

convey meaning of phrase:• “il ne va pas” -> “he does not go”

• IBM models prevent alignment of foreign words with >1 English word:

• “aller” -> “to go”

Page 45: Statistical Machine Translation: IBM Models and the Alignment Template System

Phrase-Based MT

• Phrase = basic unit of translation

• Strengths:• explicit modeling of word context• captures local reorderings, local

dependencies

Page 46: Statistical Machine Translation: IBM Models and the Alignment Template System

Example Rules:

• English: he does not go

• Foreign: il ne va pas

• ne va pas -> does not go

Page 47: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Template System

• [Och and Ney, 2004]• Alignment template:

• Pair of source and target language phrases• Word alignment among words within those phrases

• Formally, an alignment template is a triple (F,E,A):• F = words on foreign side• E = words on English side• A = alignments among words on the foreign and

English sides

Page 48: Statistical Machine Translation: IBM Models and the Alignment Template System

Estimating P(e|f)

• Noisy channel:• Decompose P(e|f) into P(f|e) and P(e)• Estimate P(f|e) and P(e) separately

• Direct:• Estimate P(e|f) directly from training corpus • Use log-linear model

Page 49: Statistical Machine Translation: IBM Models and the Alignment Template System

[Koehn 2003]

Log-linear Models for MT

• Compute best translation as follows:

• where hi are the feature functions and λi are the model parameters

• Typical feature functions include: • phrase translation probabilities• lexical translation probabilities• language model probability • reordering model• word penalty

iii feh

e

efeP),(

)|(maxarg

Page 50: Statistical Machine Translation: IBM Models and the Alignment Template System

[Och and Ney 2003]

Log-linear Models for MT

• Noisy Channel model is a special case of Log-Linear model where:• h1 = log(P(f|e)), λ1 = 1• h2 = log(P(e)), λ2 = 1

• Then:

)(*)|()|(maxarg ))(log(*1)|(log(*1 ePefPefeP ePefP

e

Page 51: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Template System

• Word-align training corpus

• Extract phrase pairs

• Assign probabilities to phrase pairs

• Train language model

• Decode

Page 52: Statistical Machine Translation: IBM Models and the Alignment Template System

Word-Align Training Corpus:

• Run GIZA++ word alignment in normal direction, from e -> f

il ne va pas

he

does

not

go

Page 53: Statistical Machine Translation: IBM Models and the Alignment Template System

Word-Align Training Corpus:

• Run GIZA++ word alignment in inverse direction, from f->e

il ne va pas

he

does

not

go

Page 54: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Symmetrization:

• Merge bi-directional alignments using some heuristic between intersection and union

• Question: what is tradeoff in precision/recall using intersection/union?

• Here, we use union

il ne va pas

he

does

not

go

Page 55: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Template System

• Word-align training corpus

• Extract phrase pairs

• Assign probabilities to phrase pairs

• Train language model

• Decode

Page 56: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

• Extract all phrase pairs (E,F) consistent with word alignments, where consistency is defined as follows:

• (1) Each word in English phrase is aligned only with words in the foreign phrase

• (2) Each word in foreign phrase is aligned only with words in the English phrase

• Phrase pairs must consist of contiguous words in each language

il ne va pas

he

does

not

go

Page 57: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

• Question: why is the illustrated phrase pair inconsistent with the alignment matrix?

il ne va pas

he

does

not

go

Page 58: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

• Question: why is the illustrated phrase pair inconsistent with the alignment matrix?

• Answer: “ne” is aligned with “not”, which is outside the phrase pair; also, “does” is aligned with “pas”, which is outside the phrase pair

il ne va pas

he

does

not

go

Page 59: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

<he, il> il ne va pas

he

does

not

go

Page 60: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

<he, il>

<go, va>

il ne va pas

he

does

not

go

Page 61: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

<he, il>

<go, va>

<does not go,

ne va pas>

il ne va pas

he

does

not

go

Page 62: Statistical Machine Translation: IBM Models and the Alignment Template System

Extract phrase pairs:

<he, il>

<go, va>

<does not go,

ne va pas>

<he does not go,

il ne va pas>

il ne va pas

he

does

not

go

Page 63: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Template System

• Word-align training corpus

• Extract phrase pairs

• Assign probabilities to phrase pairs

• Train language model

• Decode

Page 64: Statistical Machine Translation: IBM Models and the Alignment Template System

Probability Assignment

• Use relative frequency estimation:• P(F,E,A|F) = Count(F,E,A)/Count(F,E’,A’)

Page 65: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Template System

• Word-align training corpus

• Extract phrase pairs

• Assign probabilities to phrase pairs

• Train language model

• Decode

Page 66: Statistical Machine Translation: IBM Models and the Alignment Template System

Language Model

• Use N-gram language model P(e), just as for word-based MT

Page 67: Statistical Machine Translation: IBM Models and the Alignment Template System

Alignment Template System

• Word-align training corpus

• Extract phrase pairs

• Assign probabilities to phrase pairs

• Train language model

• Decode

Page 68: Statistical Machine Translation: IBM Models and the Alignment Template System

Decode

• Beam search• State space:

• set of possible partial translation hypotheses

• Start state:• initial empty translation of foreign input

• Expansion operation:• extend existing English hypothesis one

phrase at a time, by translating a phrase in foreign sentence into English

Page 69: Statistical Machine Translation: IBM Models and the Alignment Template System

Decoder Example

• Start:• f: “Maria no dio una bofetada a la bruja verde”• e: “”

• Expand English translation:• translate “Maria” -> “Mary” or “bruja” -> “witch”• mark foreign words as covered • update probabilities

Page 70: Statistical Machine Translation: IBM Models and the Alignment Template System

Decoder Example

Example from [Koehn 2003]

Page 71: Statistical Machine Translation: IBM Models and the Alignment Template System

BLEU MT Evaluation Metric

• BLEU measure n-gram precision against a set of k reference English translations:• What percentage of n-grams (where n ranges from 1

through 5, typically) in the MT English output are also found in a reference translation?

• Brevity penalty: penalize English translations with fewer words than the reference translations

• Why is this metric so widely used?• Correlates surprisingly well with human judgment of

machine-generated translations

Page 72: Statistical Machine Translation: IBM Models and the Alignment Template System

References• Brown et al. 1990. “A statistical approach to Machine Translation”.• Brown et al. 1993. “The mathematics of statistical machine

translation”.• Collins 2003. “Lecture Notes from 6.891 Fall 2003: Machine

Learning Approaches for Natural Language Processing”.• Knight 1999. “A Statistical MT Workbook”.• Knight and Koehn 2004. “A Statistical Machine Translation Tutorial”.• Koehn, Och and Marcu 2003. “A Phrase-Based Statistical Machine

Translation System”.• Koehn, 2003. “Pharaoh: A Phrase-Based Decoder”.• Och and Ney 2004. “The Alignment Template System”.• Och and Ney 2003. “Discriminative Training and Maximum Entropy

Models for Statistical Machine Translation”.