institut für anthropomatik 114.02.2014 introduction to smt – word-based translation models jan...

Institut für Anthropomatik1 10.04.23

Introduction to SMT –Word-based Translation Models

Jan Niehues - Lehrstuhl Prof. Alex Waibel


Overview

IntroductionLexica

Alignment

IBM Model 1

EM Algorithm

Higher IBM Models

Word Alignment



Introduction

NotationSource

source (foreign) word

I: length of foreign sentence

i: position in source sentence (foreign sentence)

foreign sentence

Target: target (English) word

J: length of English sentence

j: position in english sentence

English sentence


€

f i

€

f = f1I = f1... f i ... f I

€

e j

€

e = e1I = e1...ei ...eI


Introduction

Statistical Machine Translation:Find most probable translation e for a given source sentence f

Use Bayes Rule


€

ˆ e = argmaxe

p(e | f )

€

ˆ e = argmaxe

p(e | f ) = argmaxe

p( f | e) p(e)

p( f )

= argmaxe

p( f | e) p(e)


System overview



Word-based Translation Model

Word-based models were introduced by Brown et al. in early 90s

Directly translate source words to target words

Model word-by-word translation probabilities

First statistical approach to machine translation

No longer state of the art

Used to generate word alignment for phrase extraction in phrase based models



Lexica

Store translation of the source words

One word can have several translations

Example:Haus – house, building, home, household, shell

Some are more likely, others are only used in certain circumstances

How to decide which one to use in the translation?

Use statistics



Lexica

0.00550Shell

0.015150Household

0.02200Home

0.161600Building

0.88000House

ProbabilityCountsTranslations

• Collect counts of different translation

• Approximate probability distribution )(: epep ff



Alignment

Mapping between source and target words that are translations of each other

Example: Input:

“das Haus ist klein”

Probabilistic Lexicon

Possible word-by-word translation:The house is small

Implicit alignment between source and target sentence:



AlignmentFormalized as a function:

Maps target word position to source word position

Example:

€

a : j →i

}44,33,22,11{: a


Alignment DifficultiesWord reordering:

Leads to non-monoton alignment


Alignment DifficultiesMany-to-one alignments:

One word of the input language is translated into several words

}45,44,33,22,11{: a


Alignment DifficultiesDeletion:

For some source words there is no equivalent in the translation

}54,33,22,11{: a


Alignment DifficultiesInsertion:

Some words of the target sentence have no equivalent in the source sentence

Add NULL word to have still a fully defined alignment function

}67,56,55,24,43,02,11{: a


Alignment RemarksMany-to-one alignments are possible but no one-to-many alignment

In this models alignments are represented by a function

Leads to problems with languages like Chinese-English

In phrase-based system this is solved by looking at the translation process from both directions


IBM Model 1Model that generates different translations for a sentence with associated probability

Generative Model: Break modeling of sentence translations into smaller steps of word-to-word translations with a coherent story

Probability of the English sentence e and Alignment a given Foreign sentence f

Number of possible alignments:

Normalization constant:

elfl )1(

e

e

l

jjajl

f

fetl

faep1

)( )|()1(

)|,(

€


IBM 1 Example



IBM 1 Training

Learn translation probability distributions

Problem: incomplete dataOnly large amounts sentence-aligned parallel texts are available

Lack alignment information

Consider alignment as a hidden variable

Approach: Expectation maximization (EM) algorithm



EM Algorithm

1. Initialize the modelUse uniform distribution

2. Apply the model to the data (expectation step)Compute alignment probabilities

First all are equal but later “Hause” will be most likely translated to “house”

3. Learn the model from the data (maximization step)Learn translation probabilities from guess alignment

Use best alignment or all with weights according to their probability

4. Iterate steps 2 and 3 until convergence



Step 2

Calculate probability of an alignment

Using dynamic programming we can reduce the complexity from exponential to quadratic in sentence length



Step 2

Put together both equations:



Step 3Collect counts from every sentence pair (e,f):

Calculate translation probabilities:


Pseudo-code



Example



Convergence

Goal:Find model that best fits the data

Measure:How well does it translate unseen sentences?

At this point no test data

How well does it model the training data



Convergence

Initial Model:

First iteration:

Final:

Probability of training sentences increases



Convergenz

• Perplexity of the model:

• Perplexity is guaranteed to decrease or stay the same at each iterations

• EM converges to local minimum• IBM1: global miminum



Higher IBM Models• IBM1 is very simple

• No treatment of reordering and adding or dropping words

• Five models of increasing complexity were proposed by Brown et al.

Lexicon plus relative positionsHMM

Fixes deficiencyIBM Model 5

Relative alignment positionsIBM Model 4

adds fertility modelIBM Model 3

adds absolute positionIBM Model 2

Lexical translationsIBM Model 1


Higher IBM Models• Complexity of training grows, but general principal stays the same

• During training:

– First train IBM Model 1

– Use IBM Model 1 to initialize IBM Model 2

– …

• All models are implemented in the GIZA ++ Toolkit

• Used by many groups

• Parallel version developed at CMU


IBM Model 2

• Problem of IBM Model 1: same probability for these both sentence pairs

Model for the alignment based on positions of input and output words



IBM Model 2

• Two step procedure:

• Mathematical formulation:



IBM Model 2

• Alignment step:

• Mathematical formulation in IBM1:


€

a(1 |1,6,5) * a(1 | 2,6,5) * a(3 | 3,6,5) * a(4 | 4,6,5) * a(2 | 5,6,5) * a(5 | 6,6,5)

=1/6*1/6*1/6*1/6*1/6*1/6

= 2.14 *10−5


IBM Model 2

• Alignment step:

• Mathematical formulation in IBM2:


€

a(1 |1,6,5) * a(1 | 2,6,5) * a(3 | 3,6,5) * a(4 | 4,6,5) * a(2 | 5,6,5) * a(5 | 6,6,5)

=1/3*1/4 *1/3*1/3*1/10 *1/2

= 4.6296 *10−4


IBM Model 2

• Training:– Similar to IBM Model 1 training

• Initialization:– Initialize with values of IBM Model 1 training

– Alignment probability:

1

1),,|(

ffe l

lljia


IBM Model 2


• Did not model how many words are generated by a input word• Model fertility by a probability distribution:

• Examples:

• Add additional step to the model

IBM Model 3



IBM Model 3

• Word deletion:– Modelled by Fertility 0

• Word insertions:– Could be modelled by Fertility of NULL word:– But Fertility should depend on the sentence length– Instead add NULL Insertion step

• NULL Insert step:– Add NULL token after every word with probability or not with probability

1p

10 1 pp

)|( nulln

€

p1


IBM Model 3



IBM Model 3

• Distortion model instead of Alignment model:– Different distortions in both productions by same alignment

– Different direction of both models:

),,|( fe lljia),,|( fe llijd


IBM Model 3 Mathematical Formulation

• Fertility step– Fertility greater than one:

• Different tableaus for same alignment• Alignment probability for all tableau are the same• Number of different tableaus generating same alignment:• All tableaus generating same alignment have the same probability

– Probabilitiy:


€

Φi!

€

Φi!n(i=1

l f

∏ Φ i | f i)



• NULL Word insertion– Number of generated NULL words:

• Depend on the number of generated output words from input puts words• After each generated word there may be inserted a NULL Word• s words generated from foreign input words• Maximal number of generated NULL words• Probability:


€

Φ0



• NULL Word insertion


€

(7 −1

1) *0.1*0.9* 0.9*0.9* 0.9*0.9

= 0.413343


IBM Model 3 Training

• Problem: Exponential Number of Alignments– IBM1/2: Dynamic Programming

– IBM 3: No longer possible to use

• Sampling from space of possible alignments– Find most probable alignments– Add additional similar alignments– Use only these alignment for normalization




• Finding most probable alignment– Exp. Number -> test all possible alignments to complex

– Use Hill climbing algorithm• Evaulate all points in neighbour• Go to highest Point• Iterate

• Problem: may end in local maxima– Start a various locations




• Initialization:– Exp. Number -> test all possible alignments to complex

– Use Hill climbing algorithm• Evaulate all points in neighbour• Go to highest Point• Iterate

• Problem: may end in local maxima– Start a various locations

– Pegging



Pegging

• For all indices i– For all indices j

• Set alignment a(j)=i• Find most probable alignment under this condition• Add to set of starting points



Hillcliming

• Find most probable alignment in neighborhood• Neighborhood:

– Alignments differ by move• Two alignments differ a1 and a2 differ by a move if the alignments differ only in the

alignment for one word j

– Alignments differ by swap• Two alignments a1 and a2 differ by a swap if the agree in the alignments for all

words, except for two, for which the alignment points are switched:



IBM3 Training

• Summary for IBM3 training– Sampling the alignments

• Pegging

– Collecting counts

– Estimating probabilities



IBM Model 4

• Distortion Model:– Absolute Position in IBM Model 3

– Long sentences are relative rare

– Distortion probability can not approximated well

– Use relative position instead

– Problem:• Added Words• Droped Words• One-to-many alignments

€

d(j | i, le, l f )


IBM Model 4

• Cept:– Each input word fj that is aligned to at least one output word forms a cept


IBM Model 4

• Cept:– Each input word fj that is aligned to at least one output word forms a cept

– Center:• Ceiling of the average of the output word positions


IBM Model 4

• Relative distortion:– Define relative distortion for each output word

1. Target words generated by the NULL token:• Uniform distribution

2. First word of a cept• Word position j relative to the center of the preceding cept i-1

3. Subsequent words in a cept• Word position I relative to postion of previous word in the cept


IBM Model 4


IBM Model 4

• Word classes:– Richer conditioning on the distortion:

• Some words are reordered more often• E.g.: Adjectives when translating form English to French

• Not sufficient statistics to estimate probabilities• Group words into word classes

• Possible classes: POS,• Originally: automatically cluster words



IBM Model 5

• Deficiency: According to IBM Model 3 and 4 multiple output words can be placed at the same position

• Positive probability for impossible alignments• IBM Model 5 prevent this

– No longer multiple tablaux with same alignment

– Place words only into vacant words position

– For all word positions• How many untranslated words until this word

• No improvement in alignment quality• Not used in most state-of-the-art systems



HMM Alignment Model

• HMM successfully used in speech recognition• Introduced by Vogel et. al• Idea: Use relative position instead of absolute

– Entire word groups (phrases) are moved with respect to source position

• Giza Toolkit:– Replace IBM2 by HMM Model



HMM Alignment Model



HMM Alignment Model

• First order model: target position dependent on previous target position(captures movement of entire phrases)

• Alignment probability:

• Maximum approximation:


),,|(),,|Pr( 101

1 IJaapeJaa jjIj

j

J

j

a

J

jajjj

IJ efpIaapIJpef1 1

111 )|(),|()|()|Pr(

J

jajjj

a

IJ

jJefpIaapIJpef

1111 )|(),|(max)|()|Pr(

1


Viterbi Training


# Accumulation (over corpus)# find Viterbi pathFor each sentence pair For each source position j For each target position i Pbest = 0; t = p(fj|ei) For each target position i’ Pprev = P(j-1,i’) a = p(i|i’,I,J) Pnew = Pprev*t*a if (Pnew > Pbest) Pbest = Pnew

BackPointer(j,i) = i’

# update countsi = argmax{ BackPointer( J, I ) }For each j from J downto 1 Count(f_j, e_i)++ Count(i,iprev,I,J)++ i = BackPoint(j,i)

# renormalize…

Pprev

a = p(i | i’,I,J)

t = p(fj | ei)

Pnew=Pprev*a*t


HMM Forward-Backward Training

• Gamma: Probailitiy to emit fj when state i in sentence s

• Sum over all paths through (j,i)

Jan Niehues - Lehrstuhl Prof. Alex Waibelj

i

iaa

J

jajjj

sj

jJ

jefpIaapi

, 1''1''

1

')|(),|()(


HMM Forward-Backward Training

• Epsilon: Probability to transit from state I’ into I

• Sum over all paths through (j-1,I’) and (j,i) emitting fj


iaiaa

J

jajjj

jjJ

jefpIaapii

,', 1''1''

11

')|(),|(),'(

11-731 Machine Translation (2009)j-1

i

j


Forward Probabilities

• Defined as:

• Recursion:

• Initial condition


iaa

j

jajjjj

jj

jefpIaapi

, 1''1''

1

')|(),|()(

)|(),'|()'()(1'

1 ij

I

ijj efpIiipii

)|(),0|()( 10 iefpIipi

j

i


Backward Probabilities

• Defined as:

• Recursion:

• Initial condition


)|(),|'()'()( 11'

1 ij

I

ijj efpIiipii

iaa

J

jjajjjj

jJj

jefpIaapi

, ''1'' )|(),|()(

'

1)(0 I

j

i


Forward-Backward

• Calculaate Gamma and Epsilon with Alpha and Beta

– Gamma:

– Epsilon


I

ij

j

ii

ii

1'

)()'(

)()(

iijijj

jijj

iefpIiipi

iefpIiipiii

~,'

~~1

1

)~

()|(),'~

|~

()'~

(

)()|(),'|()'(),'(


Parameter Re-Estimation

• Lexicon probabilties:

• Aligment probailities:


S

s

J

eej

sj

S

s

J

eeffj

sj

s

i

s

ij

i

i

efp

1 1

1,1

)(

)(

)|(

€

p(i | i') =

ε js (i',i)

j =1

J s

∑s=1

S

∑

γ js (i)

j =1

J s

∑s=1

S

∑


Forward-Backward Training Pseudo Code


# Accumulation

For each sentence-pair {

Forward. (Calculate Alpha’s)

Backward. (Calculate Beta’s)

Calculate Xi’s and Gamma’s.

For each source word {

Increase LexiconCount(f_j|e_i) by Gamma(j,i).

Increase AlignCount(i|i’) by Epsilon(j,i,i’).

}

}

# Update

Normalize LexiconCount to get P(f_j|e_i).

Normalize AlignCount to get P(i|i’).


Example HMM Training



IBM Models

• Phrase-based systems outperform these word-based translation models• IBM Models can be used to generate a word alignment by using the viterbi

path• Problem: 1-to-many• But we can generate many-to-1 alignments• Use alignments from both directions and combine with a heuristic



Word alignment



Word alignment

• Evaluation:– Given some manually aligned data (ref) and automatically aligned data (hyp)

links can be• Correct, i.e. link in hyp matches link in ref: true positive (tp)• Wrong, i.e. link in hyp but not in ref: false positive (fp)• Missing, i.e. link in ref but not in hyp: false negaitve (fn)



Word alignment Measures

• Precision:– Number of correct links / Number of links in hyp

– Problem:• Less Links -> Improve Presicion

• Recall:– Number of correct links / Number of links in reference

– Problem:• All links in Alignment -> Recall = 1


€

Precision =t p

t p + f p

=A ∩ R

| A |

€

Recall =t p

t p + fn

A ∩ R

| R |



• Precision:– Number of correct links / Number of links in hyp

– Problem:• Less Links -> Improve Presicion

• Recall:– Number of correct links / Number of links in reference

– Problem:• All links in Alignment -> Recall = 1


€

Precision =t p

t p + f p

=A ∩ R

| A |

€

Recall =t p

t p + fn

A ∩ R

| R |



• F-Score:

• Alignment error rate (AER):


€

F − Score =2* t p

2* t p + f p + fn

2* A ∩ R

| A | + | R |

€

AER =1 − F − Score


Refernce

• Sometimes it is difficult for human annotators to decide• Differentiate between sure and possible links• Sets:

– A: generated links

– S: sure links (not finding a sure link is an error)

– P: possible links (putting a link which is not possible is an error)

– Alignment error rate



Conclusion

• Word-based Translation Models• Word alignment as hidden variable• Only 1-n alignments possible


institut für anthropomatik 114.02.2014 introduction to smt – word-based translation models jan...

Documents

institut fr anthropomatik

translation slide

alex waibel slide

niehues lehrstuhl

target sentence

word translation probabilities

guess alignment

alignment mapping