institut für anthropomatik 114.02.2014 introduction to smt – word-based translation models jan...
TRANSCRIPT
Institut für Anthropomatik1 10.04.23
Introduction to SMT –Word-based Translation Models
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik2 10.04.23
Overview
IntroductionLexica
Alignment
IBM Model 1
EM Algorithm
Higher IBM Models
Word Alignment
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik3 10.04.23
Introduction
NotationSource
source (foreign) word
I: length of foreign sentence
i: position in source sentence (foreign sentence)
foreign sentence
Target: target (English) word
J: length of English sentence
j: position in english sentence
English sentence
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
f i
€
f = f1I = f1... f i ... f I
€
e j
€
e = e1I = e1...ei ...eI
Institut für Anthropomatik4 10.04.23
Introduction
Statistical Machine Translation:Find most probable translation e for a given source sentence f
Use Bayes Rule
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
ˆ e = argmaxe
p(e | f )
€
ˆ e = argmaxe
p(e | f ) = argmaxe
p( f | e) p(e)
p( f )
= argmaxe
p( f | e) p(e)
Institut für Anthropomatik5 10.04.23
System overview
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik6 10.04.23
Word-based Translation Model
Word-based models were introduced by Brown et al. in early 90s
Directly translate source words to target words
Model word-by-word translation probabilities
First statistical approach to machine translation
No longer state of the art
Used to generate word alignment for phrase extraction in phrase based models
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik7 10.04.23
Lexica
Store translation of the source words
One word can have several translations
Example:Haus – house, building, home, household, shell
Some are more likely, others are only used in certain circumstances
How to decide which one to use in the translation?
Use statistics
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik8 10.04.23
Lexica
0.00550Shell
0.015150Household
0.02200Home
0.161600Building
0.88000House
ProbabilityCountsTranslations
• Collect counts of different translation
• Approximate probability distribution )(: epep ff
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik9 10.04.23
Alignment
Mapping between source and target words that are translations of each other
Example: Input:
“das Haus ist klein”
Probabilistic Lexicon
Possible word-by-word translation:The house is small
Implicit alignment between source and target sentence:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik10 10.04.23
AlignmentFormalized as a function:
Maps target word position to source word position
Example:
€
a : j →i
}44,33,22,11{: a
Institut für Anthropomatik11 10.04.23
Alignment DifficultiesWord reordering:
Leads to non-monoton alignment
Institut für Anthropomatik12 10.04.23
Alignment DifficultiesMany-to-one alignments:
One word of the input language is translated into several words
}45,44,33,22,11{: a
Institut für Anthropomatik13 10.04.23
Alignment DifficultiesDeletion:
For some source words there is no equivalent in the translation
}54,33,22,11{: a
Institut für Anthropomatik14 10.04.23
Alignment DifficultiesInsertion:
Some words of the target sentence have no equivalent in the source sentence
Add NULL word to have still a fully defined alignment function
}67,56,55,24,43,02,11{: a
Institut für Anthropomatik15 10.04.23
Alignment RemarksMany-to-one alignments are possible but no one-to-many alignment
In this models alignments are represented by a function
Leads to problems with languages like Chinese-English
In phrase-based system this is solved by looking at the translation process from both directions
Institut für Anthropomatik16 10.04.23
IBM Model 1Model that generates different translations for a sentence with associated probability
Generative Model: Break modeling of sentence translations into smaller steps of word-to-word translations with a coherent story
Probability of the English sentence e and Alignment a given Foreign sentence f
Number of possible alignments:
Normalization constant:
elfl )1(
e
e
l
jjajl
f
fetl
faep1
)( )|()1(
)|,(
€
Institut für Anthropomatik17 10.04.23
IBM 1 Example
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik18 10.04.23
IBM 1 Training
Learn translation probability distributions
Problem: incomplete dataOnly large amounts sentence-aligned parallel texts are available
Lack alignment information
Consider alignment as a hidden variable
Approach: Expectation maximization (EM) algorithm
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik19 10.04.23
EM Algorithm
1. Initialize the modelUse uniform distribution
2. Apply the model to the data (expectation step)Compute alignment probabilities
First all are equal but later “Hause” will be most likely translated to “house”
3. Learn the model from the data (maximization step)Learn translation probabilities from guess alignment
Use best alignment or all with weights according to their probability
4. Iterate steps 2 and 3 until convergence
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik20 10.04.23
Step 2
Calculate probability of an alignment
Using dynamic programming we can reduce the complexity from exponential to quadratic in sentence length
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik21 10.04.23
Step 2
Put together both equations:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik22 10.04.23
Step 3Collect counts from every sentence pair (e,f):
Calculate translation probabilities:
Institut für Anthropomatik23 10.04.23
Pseudo-code
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik24 10.04.23
Example
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik25 10.04.23
Convergence
Goal:Find model that best fits the data
Measure:How well does it translate unseen sentences?
At this point no test data
How well does it model the training data
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik26 10.04.23
Convergence
Initial Model:
First iteration:
Final:
Probability of training sentences increases
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik27 10.04.23
Convergenz
• Perplexity of the model:
• Perplexity is guaranteed to decrease or stay the same at each iterations
• EM converges to local minimum• IBM1: global miminum
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik28 10.04.23
Higher IBM Models• IBM1 is very simple
• No treatment of reordering and adding or dropping words
• Five models of increasing complexity were proposed by Brown et al.
Lexicon plus relative positionsHMM
Fixes deficiencyIBM Model 5
Relative alignment positionsIBM Model 4
adds fertility modelIBM Model 3
adds absolute positionIBM Model 2
Lexical translationsIBM Model 1
Institut für Anthropomatik29 10.04.23
Higher IBM Models• Complexity of training grows, but general principal stays the same
• During training:
– First train IBM Model 1
– Use IBM Model 1 to initialize IBM Model 2
– …
• All models are implemented in the GIZA ++ Toolkit
• Used by many groups
• Parallel version developed at CMU
Institut für Anthropomatik30 10.04.23
IBM Model 2
• Problem of IBM Model 1: same probability for these both sentence pairs
Model for the alignment based on positions of input and output words
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik31 10.04.23
IBM Model 2
• Two step procedure:
• Mathematical formulation:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik32 10.04.23
IBM Model 2
• Lexical translation step:
• Mathematical formulation:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
t(of | natürlich) * t(course | natürlich) * t(is | ist) * t(the | das) * t(house | haus) * t(small | klein)
= 0.5* 0.6*0.7*0.8*0.8 *0.5
= 0.0672
Institut für Anthropomatik33 10.04.23
IBM Model 2
• Alignment step:
• Mathematical formulation in IBM1:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
a(1 |1,6,5) * a(1 | 2,6,5) * a(3 | 3,6,5) * a(4 | 4,6,5) * a(2 | 5,6,5) * a(5 | 6,6,5)
=1/6*1/6*1/6*1/6*1/6*1/6
= 2.14 *10−5
Institut für Anthropomatik34 10.04.23
IBM Model 2
• Alignment step:
• Mathematical formulation in IBM2:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
a(1 |1,6,5) * a(1 | 2,6,5) * a(3 | 3,6,5) * a(4 | 4,6,5) * a(2 | 5,6,5) * a(5 | 6,6,5)
=1/3*1/4 *1/3*1/3*1/10 *1/2
= 4.6296 *10−4
Institut für Anthropomatik35 10.04.23
IBM Model 2
• Training:– Similar to IBM Model 1 training
• Initialization:– Initialize with values of IBM Model 1 training
– Alignment probability:
1
1),,|(
ffe l
lljia
Institut für Anthropomatik36 10.04.23
IBM Model 2
Institut für Anthropomatik37 10.04.23
IBM Model 2
Institut für Anthropomatik38 10.04.23
• Did not model how many words are generated by a input word• Model fertility by a probability distribution:
• Examples:
• Add additional step to the model
IBM Model 3
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik39 10.04.23
IBM Model 3
• Word deletion:– Modelled by Fertility 0
• Word insertions:– Could be modelled by Fertility of NULL word:– But Fertility should depend on the sentence length– Instead add NULL Insertion step
• NULL Insert step:– Add NULL token after every word with probability or not with probability
1p
10 1 pp
)|( nulln
€
p1
Institut für Anthropomatik40 10.04.23
IBM Model 3
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik41 10.04.23
IBM Model 3
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik42 10.04.23
IBM Model 3
• Distortion model instead of Alignment model:– Different distortions in both productions by same alignment
– Different direction of both models:
),,|( fe lljia),,|( fe llijd
Institut für Anthropomatik43 10.04.23
IBM Model 3 Mathematical Formulation
• Fertility step– Fertility greater than one:
• Different tableaus for same alignment• Alignment probability for all tableau are the same• Number of different tableaus generating same alignment:• All tableaus generating same alignment have the same probability
– Probabilitiy:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
Φi!
€
Φi!n(i=1
l f
∏ Φ i | f i)
Institut für Anthropomatik44 10.04.23
IBM Model 3 Mathematical Formulation
• Fertility step
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
1!*n(1 | ich) *1!n(1 | gehe) *0!n(0 | ja) *1!n(1 | nicht) *2!(2 | zum) *1!(1 | haus)
= 0.9 *0.9 *0.4 * 0.8* 2* 0.7 *0.8
= 0.290304
Institut für Anthropomatik45 10.04.23
IBM Model 3 Mathematical Formulation
• NULL Word insertion– Number of generated NULL words:
• Depend on the number of generated output words from input puts words• After each generated word there may be inserted a NULL Word• s words generated from foreign input words• Maximal number of generated NULL words• Probability:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
Φ0
Institut für Anthropomatik46 10.04.23
IBM Model 3 Mathematical Formulation
• NULL Word insertion
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
(7 −1
1) *0.1*0.9* 0.9*0.9* 0.9*0.9
= 0.413343
Institut für Anthropomatik47 10.04.23
IBM Model 3 Mathematical Formulation
• Combine Fertility, lexical translation and distortion probabilities
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
p(e | f ) = p(e,a | f )a
∑
= ... (le − Φ0
Φ0
) Φ i!n(Φ | f i)i=1
l f
∏a( le )=0
l f
∑a(1)=0
l f
∑
t(e j | fa( j ))d( j | a( j),le, l f )j =1
le
∏
Institut für Anthropomatik48 10.04.23
IBM Model 3 Training
• Problem: Exponential Number of Alignments– IBM1/2: Dynamic Programming
– IBM 3: No longer possible to use
• Sampling from space of possible alignments– Find most probable alignments– Add additional similar alignments– Use only these alignment for normalization
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik49 10.04.23
IBM Model 3 Training
• Finding most probable alignment– Exp. Number -> test all possible alignments to complex
– Use Hill climbing algorithm• Evaulate all points in neighbour• Go to highest Point• Iterate
• Problem: may end in local maxima– Start a various locations
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik50 10.04.23
IBM Model 3 Training
• Initialization:– Exp. Number -> test all possible alignments to complex
– Use Hill climbing algorithm• Evaulate all points in neighbour• Go to highest Point• Iterate
• Problem: may end in local maxima– Start a various locations
– Pegging
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik51 10.04.23
Pegging
• For all indices i– For all indices j
• Set alignment a(j)=i• Find most probable alignment under this condition• Add to set of starting points
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik52 10.04.23
Hillcliming
• Find most probable alignment in neighborhood• Neighborhood:
– Alignments differ by move• Two alignments differ a1 and a2 differ by a move if the alignments differ only in the
alignment for one word j
– Alignments differ by swap• Two alignments a1 and a2 differ by a swap if the agree in the alignments for all
words, except for two, for which the alignment points are switched:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik53 10.04.23
IBM3 Training
• Summary for IBM3 training– Sampling the alignments
• Pegging
– Collecting counts
– Estimating probabilities
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik54 10.04.23
IBM Model 4
• Distortion Model:– Absolute Position in IBM Model 3
– Long sentences are relative rare
– Distortion probability can not approximated well
– Use relative position instead
– Problem:• Added Words• Droped Words• One-to-many alignments
€
d(j | i, le, l f )
Institut für Anthropomatik55 10.04.23
IBM Model 4
• Cept:– Each input word fj that is aligned to at least one output word forms a cept
Institut für Anthropomatik56 10.04.23
IBM Model 4
• Cept:– Each input word fj that is aligned to at least one output word forms a cept
– Center:• Ceiling of the average of the output word positions
Institut für Anthropomatik57 10.04.23
IBM Model 4
• Relative distortion:– Define relative distortion for each output word
1. Target words generated by the NULL token:• Uniform distribution
2. First word of a cept• Word position j relative to the center of the preceding cept i-1
3. Subsequent words in a cept• Word position I relative to postion of previous word in the cept
Institut für Anthropomatik58 10.04.23
IBM Model 4
Institut für Anthropomatik59 10.04.23
IBM Model 4
• Word classes:– Richer conditioning on the distortion:
• Some words are reordered more often• E.g.: Adjectives when translating form English to French
• Not sufficient statistics to estimate probabilities• Group words into word classes
• Possible classes: POS,• Originally: automatically cluster words
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik60 10.04.23
IBM Model 5
• Deficiency: According to IBM Model 3 and 4 multiple output words can be placed at the same position
• Positive probability for impossible alignments• IBM Model 5 prevent this
– No longer multiple tablaux with same alignment
– Place words only into vacant words position
– For all word positions• How many untranslated words until this word
• No improvement in alignment quality• Not used in most state-of-the-art systems
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik61 10.04.23
HMM Alignment Model
• HMM successfully used in speech recognition• Introduced by Vogel et. al• Idea: Use relative position instead of absolute
– Entire word groups (phrases) are moved with respect to source position
• Giza Toolkit:– Replace IBM2 by HMM Model
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik62 10.04.23
HMM Alignment Model
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik63 10.04.23
HMM Alignment Model
• First order model: target position dependent on previous target position(captures movement of entire phrases)
• Alignment probability:
• Maximum approximation:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
),,|(),,|Pr( 101
1 IJaapeJaa jjIj
j
J
j
a
J
jajjj
IJ efpIaapIJpef1 1
111 )|(),|()|()|Pr(
J
jajjj
a
IJ
jJefpIaapIJpef
1111 )|(),|(max)|()|Pr(
1
Institut für Anthropomatik64 10.04.23
Viterbi Training
Jan Niehues - Lehrstuhl Prof. Alex Waibel
# Accumulation (over corpus)# find Viterbi pathFor each sentence pair For each source position j For each target position i Pbest = 0; t = p(fj|ei) For each target position i’ Pprev = P(j-1,i’) a = p(i|i’,I,J) Pnew = Pprev*t*a if (Pnew > Pbest) Pbest = Pnew
BackPointer(j,i) = i’
# update countsi = argmax{ BackPointer( J, I ) }For each j from J downto 1 Count(f_j, e_i)++ Count(i,iprev,I,J)++ i = BackPoint(j,i)
# renormalize…
Pprev
a = p(i | i’,I,J)
t = p(fj | ei)
Pnew=Pprev*a*t
Institut für Anthropomatik65 10.04.23
HMM Forward-Backward Training
• Gamma: Probailitiy to emit fj when state i in sentence s
• Sum over all paths through (j,i)
Jan Niehues - Lehrstuhl Prof. Alex Waibelj
i
iaa
J
jajjj
sj
jJ
jefpIaapi
, 1''1''
1
')|(),|()(
Institut für Anthropomatik66 10.04.23
HMM Forward-Backward Training
• Epsilon: Probability to transit from state I’ into I
• Sum over all paths through (j-1,I’) and (j,i) emitting fj
Jan Niehues - Lehrstuhl Prof. Alex Waibel
iaiaa
J
jajjj
jjJ
jefpIaapii
,', 1''1''
11
')|(),|(),'(
11-731 Machine Translation (2009)j-1
i
j
Institut für Anthropomatik67 10.04.23
Forward Probabilities
• Defined as:
• Recursion:
• Initial condition
Jan Niehues - Lehrstuhl Prof. Alex Waibel
iaa
j
jajjjj
jj
jefpIaapi
, 1''1''
1
')|(),|()(
)|(),'|()'()(1'
1 ij
I
ijj efpIiipii
)|(),0|()( 10 iefpIipi
j
i
Institut für Anthropomatik68 10.04.23
Backward Probabilities
• Defined as:
• Recursion:
• Initial condition
Jan Niehues - Lehrstuhl Prof. Alex Waibel
)|(),|'()'()( 11'
1 ij
I
ijj efpIiipii
iaa
J
jjajjjj
jJj
jefpIaapi
, ''1'' )|(),|()(
'
1)(0 I
j
i
Institut für Anthropomatik69 10.04.23
Forward-Backward
• Calculaate Gamma and Epsilon with Alpha and Beta
– Gamma:
– Epsilon
Jan Niehues - Lehrstuhl Prof. Alex Waibel
I
ij
j
ii
ii
1'
)()'(
)()(
iijijj
jijj
iefpIiipi
iefpIiipiii
~,'
~~1
1
)~
()|(),'~
|~
()'~
(
)()|(),'|()'(),'(
Institut für Anthropomatik70 10.04.23
Parameter Re-Estimation
• Lexicon probabilties:
• Aligment probailities:
Jan Niehues - Lehrstuhl Prof. Alex Waibel
S
s
J
eej
sj
S
s
J
eeffj
sj
s
i
s
ij
i
i
efp
1 1
1,1
)(
)(
)|(
€
p(i | i') =
ε js (i',i)
j =1
J s
∑s=1
S
∑
γ js (i)
j =1
J s
∑s=1
S
∑
Institut für Anthropomatik71 10.04.23
Forward-Backward Training Pseudo Code
Jan Niehues - Lehrstuhl Prof. Alex Waibel
# Accumulation
For each sentence-pair {
Forward. (Calculate Alpha’s)
Backward. (Calculate Beta’s)
Calculate Xi’s and Gamma’s.
For each source word {
Increase LexiconCount(f_j|e_i) by Gamma(j,i).
Increase AlignCount(i|i’) by Epsilon(j,i,i’).
}
}
# Update
Normalize LexiconCount to get P(f_j|e_i).
Normalize AlignCount to get P(i|i’).
Institut für Anthropomatik72 10.04.23
Example HMM Training
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik73 10.04.23
IBM Models
• Phrase-based systems outperform these word-based translation models• IBM Models can be used to generate a word alignment by using the viterbi
path• Problem: 1-to-many• But we can generate many-to-1 alignments• Use alignments from both directions and combine with a heuristic
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik74 10.04.23
Word alignment
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik75 10.04.23
Word alignment
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik76 10.04.23
Word alignment
• Evaluation:– Given some manually aligned data (ref) and automatically aligned data (hyp)
links can be• Correct, i.e. link in hyp matches link in ref: true positive (tp)• Wrong, i.e. link in hyp but not in ref: false positive (fp)• Missing, i.e. link in ref but not in hyp: false negaitve (fn)
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik77 10.04.23
Word alignment Measures
• Precision:– Number of correct links / Number of links in hyp
– Problem:• Less Links -> Improve Presicion
• Recall:– Number of correct links / Number of links in reference
– Problem:• All links in Alignment -> Recall = 1
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
Precision =t p
t p + f p
=A ∩ R
| A |
€
Recall =t p
t p + fn
A ∩ R
| R |
Institut für Anthropomatik78 10.04.23
Word alignment Measures
• Precision:– Number of correct links / Number of links in hyp
– Problem:• Less Links -> Improve Presicion
• Recall:– Number of correct links / Number of links in reference
– Problem:• All links in Alignment -> Recall = 1
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
Precision =t p
t p + f p
=A ∩ R
| A |
€
Recall =t p
t p + fn
A ∩ R
| R |
Institut für Anthropomatik79 10.04.23
Word alignment Measures
• F-Score:
• Alignment error rate (AER):
Jan Niehues - Lehrstuhl Prof. Alex Waibel
€
F − Score =2* t p
2* t p + f p + fn
2* A ∩ R
| A | + | R |
€
AER =1 − F − Score
Institut für Anthropomatik80 10.04.23
Refernce
• Sometimes it is difficult for human annotators to decide• Differentiate between sure and possible links• Sets:
– A: generated links
– S: sure links (not finding a sure link is an error)
– P: possible links (putting a link which is not possible is an error)
– Alignment error rate
Jan Niehues - Lehrstuhl Prof. Alex Waibel
Institut für Anthropomatik81 10.04.23
Conclusion
• Word-based Translation Models• Word alignment as hidden variable• Only 1-n alignments possible
Jan Niehues - Lehrstuhl Prof. Alex Waibel