decoding in smt and more - cl.lingfil.uu.sethe log-linear model: decoding and tuning in statistical...
TRANSCRIPT
Decoding in SMTand more
Sara StymneSeptember 24, 2018
Partly based on slides by Christian Hardmeier and Jörg Tiedemann
måndag 24 september 18
Road Map
Decoding Phrase-Based SMT Models• iterative hypothesis expansion• pruning• beam search using stack decoding
Factored translationDocument-level translation
måndag 24 september 18
Translating with SMT Models
SMT Model• statistical model of the translation process• various components• millions and billions of parameters and options
SMT Decoder• implements search for a good translation candidate• tries to find translation hypothesis with the highest model
score (but may return sub-optimal solutions)• important features: accuracy and efficiency
måndag 24 september 18
Decoding Phrase-Based SMT
Search problem: • Find the best translation among all possible translation
candidates
The Log-Linear Model:
Decoding and Tuning inStatistical Machine Translation
Christian Hardmeier
2015-05-20
Decoding
The decoder is the part of the SMT systemthat creates the translations.
Given a set of models, how can we translateefficiently and accurately?
Decoding
Find the best translation among all possible translations.
t⇤ = arg maxt
f (s, t) = arg maxt
X
i
�ihi (s, t)
f (s, t) Scoring functionhi (s, t) Feature functions�i Feature weights
måndag 24 september 18
Model Error vs. Search Error
Model Error• The solution with the highest score according to our model
is not an acceptable translation• Model deficiency
Search Error• The decoder fails to find the solution with the highest model
score• Shortcomings of the search algorithm
måndag 24 september 18
The Generative Model of PB-SMT
Model error vs. search error
Model error: The solution with the highest scoreunder our models is not a good translation.Search error: The decoder cannot find the solutionwith the highest model score.
Phrase-based SMT: Generative Model
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house found police a large quantity of narcotics .
Behind the house police found a large quantity of narcotics .
1 Phrase segmentation2 Phrase translation3 Output ordering
Phrase-based SMT: Generative Model
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house found police a large quantity of narcotics .
Behind the house police found a large quantity of narcotics .
Behind the housethe house police
house police foundpolice found a
found a large
måndag 24 september 18
The Generative Model of PB-SMT
Model error vs. search error
Model error: The solution with the highest scoreunder our models is not a good translation.Search error: The decoder cannot find the solutionwith the highest model score.
Phrase-based SMT: Generative Model
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house found police a large quantity of narcotics .
Behind the house police found a large quantity of narcotics .
1 Phrase segmentation2 Phrase translation3 Output ordering
Phrase-based SMT: Generative Model
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house found police a large quantity of narcotics .
Behind the house police found a large quantity of narcotics .
Behind the housethe house police
house police foundpolice found a
found a largeContext-window fromN-gram language model
måndag 24 september 18
Translation OptionsTranslation Options
Illustrations by Philipp Koehn
Translation Options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not to
,
notis not
are notis not a
• Many translation options to choose from
– in Europarl phrase table: 2727 matching phrase pairs for this sentence– by pruning to the top 20 per phrase, 202 translation options remain
Decoding 8Decoding by Hypothesis Expansion
Illustrations by Philipp Koehn
Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Decoding 14
Is it always possible to translate any sentence in this way?What would cause the process to break downso the decoder can’t find a translationthat covers the whole input sentence?How could you make sure that this never happens?Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Decoding 14
Get (all) translation options for all possible segmentations of the input sentence to be translated!
måndag 24 september 18
Using the available translation optionscreate translation hypothesis from left to right:
Decoding by Hypothesis Expansion
Translation Options
Illustrations by Philipp Koehn
Translation Options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not to
,
notis not
are notis not a
• Many translation options to choose from
– in Europarl phrase table: 2727 matching phrase pairs for this sentence– by pruning to the top 20 per phrase, 202 translation options remain
Decoding 8Decoding by Hypothesis Expansion
Illustrations by Philipp Koehn
Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Decoding 14
Is it always possible to translate any sentence in this way?What would cause the process to break downso the decoder can’t find a translationthat covers the whole input sentence?How could you make sure that this never happens?Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Decoding 14
måndag 24 september 18
• Is it always possible to translate any sentence in this way?• What would cause the process to break down
so the decoder can’t find a translationthat covers the whole input sentence?
• How could you make sure that this never happens?
Decoding by Hypothesis Expansion
Translation Options
Illustrations by Philipp Koehn
Translation Options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not to
,
notis not
are notis not a
• Many translation options to choose from
– in Europarl phrase table: 2727 matching phrase pairs for this sentence– by pruning to the top 20 per phrase, 202 translation options remain
Decoding 8Decoding by Hypothesis Expansion
Illustrations by Philipp Koehn
Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Decoding 14
Is it always possible to translate any sentence in this way?What would cause the process to break downso the decoder can’t find a translationthat covers the whole input sentence?How could you make sure that this never happens?Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Decoding 14
måndag 24 september 18
Decoding Complexity
Naively, in a sentence of N words with T translation options for each phrase, we can have
• O(2N) phrase segmentations,• O(TN) sets of phrase translations,• O(N!) word reordering permutations
måndag 24 september 18
Exploiting Model Locality
What we need to score a new hypothesis is• the score of the previous hypothesis• the translation model score• the new language model score
Decoding complexity
Naively, in a sentence of N words with T translation options foreach phrase, we can have
O(2N ) phrase segmentations,O(TN ) sets of phrase translations, andO(N!) word reordering permutations.
Exploiting Model Locality
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house police found a big
To score a new hypothesis, we need:the score of the previous hypothesisthe translation model scorethe new language model scores
Hypothesis recombination
The translation model only looks at the current phrase.The n-gram model only looks at a window of n words.The choices the decoder makes are independent ofeverything beyond this window!The decoder never reconsiders its choices once they’vemoved out of the n-gram history.
given
contextindependent
limitedwindow
måndag 24 september 18
Hypothesis Recombination
Context-Independent Translation Model• looks only at the current phrase
N-gram Language Model• looks only at a window of N words
Decoder choices are independent of everything beyond this window!
• no need to reconsider alternative hypotheses once they have moved out of the n-gram history
måndag 24 september 18
Hypothesis Recombination
Example:• Three hypotheses with the same coverage• trigram language model
Competing hypotheses can be discarded because they will never beat the winning one later on!
Hypothesis recombination
Suppose we have these hypotheses with the same coverage,and we use a trigram language model:
After the house police Score = –12.5
Behind the house police Score = –11.2
, the house police Score = –22.0
We already know the winner!We can discard the competing hypotheses.
Hypothesis recombination
Hypothesis recombination combines branchesin the search graph:
It’s a form of dynamic programming.Recombination reduces the search space substantially. . .. . . it preserves search optimality. . .. . . but decoding is still exponential!
Pruning
To make decoding really efficient,we expand only hypotheses that look promising.Bad hypotheses should be pruned earlyto avoid wasting time on them.Pruning compromises search optimality!
måndag 24 september 18
Hypothesis Recombination
In the search graph: Combine branches
It is a form of dynamic programming• substantial reduction of the search space• search is still optimal• but decoding is still exponential
184 Chapter 6. Decoding
it is
it is
it is
Figure 6.3: Recombination example: Two hypothesis paths lead to two matchinghypotheses: they have the same number of foreign words translated, and the sameEnglish words in the output, but di↵erent scores. In the heuristic search they canbe recombined, the worse hypothesis is dropped.
it
he
does not
does not
it
he does not
Figure 6.4: More complex recombination example: In heuristic search with atrigram language model, also here two hypotheses can be recombined. Both hy-potheses are the same in respected to subsequent costs: The language model stillconsiders the last two words produced (does not), but not the third last word. Allother later costs are independent of the words produced so far.
worse hypothesis can never be part of the best overall path. For any paththat includes the worse hypothesis, we can substitute it with the better one,and get a better score. Thus, we can drop the worse hypothesis.
Identical English output is not required for hypothesis recombination,as the example in Figure 6.4 illustrates. Here we are considering two pathsthat start with di↵erent translations for the first German word er, namelyit and he. But then we continue with the same translation option, skippingone German word, and translation ja nicht as does not.
Here, we have two hypotheses that look similar: they are identical intheir German coverage, but di↵er slightly in the English output. However,recall our argument on why we can recombine hypotheses. If any pathstarting from the worse hypothesis can also be used to extend the better hy-pothesis with the same costs, then we can safely drop the worse hypothesis,since it can never be part of the best path.
It does not matter in our two examples that the two hypothesis pathsdi↵er in their first English word. All subsequent phrase translation costs donot depend on already generated English words. Only the language model
måndag 24 september 18
Pruning
Pruning = cut away things that are most likely not useful at all
Pruning in SMT decoding• expand only promising hypotheses• discard bad hypotheses early to avoid wasting time
Pruning greatly improves efficiency• but compromises search optimality
måndag 24 september 18
Stack DecodingStack decoding
Illustrations by Philipp Koehn
Stacks
are
it
he
goes does not
yes
no wordtranslated
one wordtranslated
two wordstranslated
three wordstranslated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down
Decoding 21
Stack decoding algorithm
1: AddToStack(s0, h0)2: for i = 0 . . . N � 1 do3: for all h 2 si do4: for all t 2 T do5: if Applicable(h, t) then6: h0 Expand(h, t)7: j WordsCovered(h) + WordsCovered(t)8: AddToStack(s j , h0)9: end if
10: end for11: end for12: end for13: return best hypothesis on stack sN
Stacks
are
it
he
goes does not
yes
no wordtranslated
one wordtranslated
two wordstranslated
three wordstranslated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down
Decoding 21
AddToStack(s, h)
1: for all h0 2 s do2: if Recombinable(h, h0) then3: add higher-scoring of h,h0 to stack s, discard other4: return5: end if6: end for7: add h to stack s8: if stack too large then9: prune stack
10: end if
sorted by the number of input words covered by the given hypothesis
limited size for each stack
måndag 24 september 18
Stack Decoding Algorithm
Stack decoding
Illustrations by Philipp Koehn
Stacks
are
it
he
goes does not
yes
no wordtranslated
one wordtranslated
two wordstranslated
three wordstranslated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down
Decoding 21
Stack decoding algorithm
1: AddToStack(s0, h0)2: for i = 0 . . . N � 1 do3: for all h 2 si do4: for all t 2 T do5: if Applicable(h, t) then6: h0 Expand(h, t)7: j WordsCovered(h) + WordsCovered(t)8: AddToStack(s j , h0)9: end if
10: end for11: end for12: end for13: return best hypothesis on stack sN
Stacks
are
it
he
goes does not
yes
no wordtranslated
one wordtranslated
two wordstranslated
three wordstranslated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down
Decoding 21
AddToStack(s, h)
1: for all h0 2 s do2: if Recombinable(h, h0) then3: add higher-scoring of h,h0 to stack s, discard other4: return5: end if6: end for7: add h to stack s8: if stack too large then9: prune stack
10: end if
next slide
måndag 24 september 18
AddToStack(s,h)
Stack decoding
Illustrations by Philipp Koehn
Stacks
are
it
he
goes does not
yes
no wordtranslated
one wordtranslated
two wordstranslated
three wordstranslated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down
Decoding 21
Stack decoding algorithm
1: AddToStack(s0, h0)2: for i = 0 . . . N � 1 do3: for all h 2 si do4: for all t 2 T do5: if Applicable(h, t) then6: h0 Expand(h, t)7: j WordsCovered(h) + WordsCovered(t)8: AddToStack(s j , h0)9: end if
10: end for11: end for12: end for13: return best hypothesis on stack sN
Stacks
are
it
he
goes does not
yes
no wordtranslated
one wordtranslated
two wordstranslated
three wordstranslated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down
Decoding 21
AddToStack(s, h)
1: for all h0 2 s do2: if Recombinable(h, h0) then3: add higher-scoring of h,h0 to stack s, discard other4: return5: end if6: end for7: add h to stack s8: if stack too large then9: prune stack
10: end if
important for efficiency
måndag 24 september 18
Pruning
Histogram Pruning• keep no more that n hypotheses per stack• Parameter: Stack size n
Threshold Pruning• discard hypotheses with low scores compared to the score
of the best hypothesis on the same stack h*• Score(h) < a * Score(h*)• Parameter: threshold factor a
måndag 24 september 18
Beam Search Complexity
Setup:• For each of the N words in the input sentence• expand n hypotheses• by considering T translation options each
Complexity: O(n * N * T)
Number of translation options is linear in sentence length:• complexity: O(n * N2)
måndag 24 september 18
Distortion Limits
Limit reordering reduces search space dramatically• most partial hypotheses cover the same input• number of translation options at each stop is bound by a
constantIs it OK to do that?
• for closely related languages: most reordering is local• we do not have good reordering models anyway• could do pre-ordering if necessary
How to prune
Histogram pruningKeep no more than S hypotheses per stack.Parameter: Stack size S
Threshold pruningDiscard hypotheses whose score is very low compared tothat of the best hypothesis on the stack h⇤:
Score(h) < ⌘ · Score(h⇤)
Parameter: Beam size ⌘
Beam search: Complexity
For each of the N words in the input sentence,expand S hypothesesby considering T translation options each:
O(S · N · T )
The number of translation options is linear in the sentence length:
O(S · N2)
Distortion limit
When translating between closely related languages,most reorderings are local. . .. . . and anyhow, we haven’t got any reasonable modelsfor long-range reordering!If we impose a limit on reordering, the number of translationoptions to consider at each step is bounded by a constant.
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house policelimited distortion window
decoder complexity:
O(n*N)
måndag 24 september 18
Incremental Scoring and Cherry Picking
Distortion limit
When translating between closely related languages,most reorderings are local. . .. . . and anyhow, we haven’t got any reasonable modelsfor long-range reordering!If we impose a limit on reordering, the number of translationoptions to consider at each step is bounded by a constant.
The number of hypotheses expanded by a beamsearch decoder with limited reordering is linear inthe stack size and the input size:
O(S · N )
Incremental scoring and cherry picking
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house police found
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house police a big
2 1
04 3
Incremental scoring and cherry picking
The path that looks cheapest necessarily incursa much higher cost later.Pruning may discard better options before this is recognised.To make scores more comparable, we shouldtake into account unavoidable future costs.Compare hypotheses based on current score + future score.
Problem: Delay expensive decisions and loose good hypotheses due to early pruning
måndag 24 september 18
Incremental Scoring and Cherry Picking
Cherry picking problems• easy decisions first• promising hypotheses may become very expensive later• pruning discards hypotheses that are cheaper in the end
Solution• estimate an approximate future cost• add future cost estimates to each hypotheses
måndag 24 september 18
Future Cost Estimation
Exact future cost require full decoding
Approximation using additional assumptions• independence between models• ignore LM history across phrase boundaries
Future cost estimation
Calculating the future cost exactly would amountto full decoding!Cheaper approximations can be computed bymaking additional independence assumptions.
Assume independence between models.Ignore LM history across phrase boundaries.Cost Estimates from Translation Options
the tourism initiative addresses this for the first time
-1.0 -2.0 -1.5 -2.4 -1.0 -1.0 -1.9 -1.6-1.4
-4.0 -2.5
-1.3
-2.2
-2.4
-2.7
-2.3
-2.3
-2.3
cost of cheapest translation options for each input span (log-probabilities)
Decoding 27
Illustrations by Philipp Koehn
DP Beam Search Decoding: Evaluation
DP beam search is by far the most popular search algorithmfor phrase-based SMT.It combines high speed with reasonable accuracy byexploiting the constraints of the standard models.It works well with very local models.
Sentence-internal long-range dependenciesincrease search errors by inhibiting recombination.No cross-sentence dependencies on the target side.
Current state of the art: Almost perfect local fluency, butserious problems with long-range reordering anddiscourse-level phenomena.
Feature Weight Tuning
f (s, t) =X
i
�ihi (s, t)
The SMT model is a weighted sum of feature scores.Translation modelLanguage modelDistortion model
The feature models are trained on corpus data.But where do we get the weights �i from?
måndag 24 september 18
Summary on Decoding
Dynamic Programming Beam Search• most popular search algorithm (but not the only one)• high speed and reasonable accuracy• fits well with the feature functions of standard PB-SMT
Limitations• not efficient with non-local models (modeling long-distance
dependencies)• not possible to include cross-sentence dependencies on the
target language side
måndag 24 september 18
Factored translation
måndag 24 september 18
No use of morphology:• treat inflectional variants (“look”, “looks”, “looked”) as
completely different words!• in learning translation models: knowing how to translate
“look” doesn’t help to translate “looks”
Works fine for English (and reasonable amounts of data)
Problems:• morphologically rich languages• sparse data sets• flexible word order
Problems with PBSMT
måndag 24 september 18
Represent words by factors
Factored models (1)4
Factored translation models
• Factored represention of words
word word
part-of-speech
OutputInput
morphology
part-of-speech
morphology
word class
lemma
word class
lemma
......• Goals
– Generalization, e.g. by translating lemmas, not surface forms– Richer model, e.g. using syntax for reordering, language modeling)
Koehn, U Edinburgh ESSLLI Summer School Day 5
5
Related work
• Back o! to representations with richer statistics (lemma, etc.)[Nießen and Ney, 2001, Yang and Kirchho! 2006, Talbot and Osborne 2006]
• Use of additional annotation in pre-processing (POS, syntax trees, etc.)[Collins et al., 2005, Crego et al, 2006]
• Use of additional annotation in re-ranking (morphological features, POS,syntax trees, etc.)[Och et al. 2004, Koehn and Knight, 2005]
! we pursue an integrated approach
• Use of syntactic tree structure[Wu 1997, Alshawi et al. 1998, Yamada and Knight 2001, Melamed 2004,Menezes and Quirk 2005, Chiang 2005, Galley et al. 2006]
! may be combined with our approach
Koehn, U Edinburgh ESSLLI Summer School Day 5109
måndag 24 september 18
Morphology• is productive• well understood• generalizable patterns
Factored models• learn translations of base forms• learn to map morphology• learn to generate target surface form
Factored models (2)
måndag 24 september 18
Represent words by factors? Why?• combine scores for translating various factors• back-off to other factors (lemma)• use various factors for reordering• better word alignment (?)Better generalization• can translate words that we haven’t seen in training• better statistics for translation optionsRicher model (more (linguistic) information)• PoS, syntactic function, semantic role, ...
Factored models (3)
måndag 24 september 18
Factored model example (1)Factored Translation Models Syntax-Oriented Statistical Models Example-based MT
Decomposing translation: example
lemma lemma
part-of-speech
OutputInput
morphology
part-of-speech
word word
Itranslate lemma and POS separately
Igenerate surface word forms from translated factors
Jörg Tiedemann 5/37
translation steps
generation stepanalysis step
måndag 24 september 18
Use benefits of general phrase-based SMT!• factored models as alternative paths (or backoff)
Factored model example (2)Factored Translation Models Syntax-Oriented Statistical Models Example-based MT
Factored modelsI Basic phrase-based SMT is very powerful!I Why generalizing if we know specific translation?
lemma lemma
part-of-speech
OutputInput
morphology
part-of-speech
word wordor
I prefer surface model for known wordsI use morphgen model as back-off
Jörg Tiedemann 8/37
måndag 24 september 18
Do not always lead to improvement
Complicated models are slow compared to standard PBSMT
Factored model resultsFactored Translation Models Syntax-Oriented Statistical Models Example-based MT
Factored models: Results & Summary
Some Results (German/English, Koehn):
System In-domain Out-of-domainBaseline 18.19 15.01
With POS LM 19.05 15.03Morphgen model 14.38 11.65Both model paths 19.47 15.23
I factors on token levelI flexible SMT frameworkI many possible factors & translation/generation stepsI not much success yet ...
Jörg Tiedemann 10/37
måndag 24 september 18
Simpler models often more useful!
Factored model example (3)
Word Word
POS
Useful with POS LM
måndag 24 september 18
Often useful with POS/morphology LMsNot much slower than standard modelsTend to give some improvements to agreement
Example: Improve word order of split compounds:Number of compound modifiers without a head:
System without POS-model: 136 System with a POS-model: 6
Simple factored models
måndag 24 september 18
Factored model in Moses
Full support in Moses:• http://www.statmt.org/moses/?n=Moses.FactoredTutorial
Data Format (example):
• 4 source language factors (word|lemma|pos|morph)• 3 target language factors (word|lemma|pos)
==> factored-corpus/proj-syndicate.de <== korruption|korruption|nn|nn.fem.cas.sg floriert|florieren|vvfin|vvfin .|.|per|per
==> factored-corpus/proj-syndicate.en <== corruption|corruption|nn flourishes|flourish|nns .|.|.
måndag 24 september 18
Word-based SMTPhrase-based SMT• Factored SMT
Still not happy? What’s the problem?
SMT Approaches
måndag 24 september 18
Word-based SMTPhrase-based SMT• Factored SMT
Still not happy? What’s the problem?• Very simple reordering models• Only consecutive phrases (word sequences)• Long-distance dependencies / agreement• Problems with ungrammatical output
Tree-based SMT• Addresses these issues to some extent• But, computationally more complex, with an ever larger search
space
SMT Approaches
måndag 24 september 18
A family of approaches (with confusing terminology)
• hierarchical phrase-based models (Hiero)• tree-to-string, string-to-tree, tree-to-tree• syntax-directed MT• syntax-augmented MT• syntactified phrase-based MT• string-to-dependency• dependency treelet-based
Tree-based SMT
måndag 24 september 18
Document-level MT
måndag 24 september 18
Discourse-Level Phenomena
Translate the following to German:
Ich sagte: “Stell es zurück!”
Aber es war nicht ihr Schlüssel.
Ich sagte: “Leg ihn zurück!”
Das Mädchen nahm meinen Schlüssel aus dem Schloss.
sein (?)
Ich sagte: “Steck ihn wieder rein!”
• The girl took my key from the door lock.
• But it wasn’t her key!
• I said: “Put it back!”
måndag 24 september 18
Connectedness of Natural Language
Textual cohesion• discourse markers• referential devices (e.g. pronominal anaphora)• ellipses (word omissions) and substitutions• lexical cohesion (word repetition, collocation)
Textual coherence• semantically meaningful relations• logical tense structure• presuppositions and implications connected to general
world knowledge
måndag 24 september 18
Discourse and Machine Translation
Long-distance relations are lost in local MT models• sentence-by-sentence translation• limited context window
Discourse-level devices do not easily map between languages• explicit vs. implicit discourse markers• grammatical agreement in anaphoric relations
Terminological consistency• domain-specific requirements
måndag 24 september 18
Pronominal Anaphora
Translating pronouns requires target-language dependencies
• The funeral of the Queen Mother will take place on Friday. It will be broadcast live.
• Les funérailles de la reine-mère auront lieu vendredi. Elles seront retransmises en direct.
Alternative translation:
• L’enterrement de la reine-mère aura lieu vendredi. Il sera retransmis en direct.
måndag 24 september 18
Beam Search Decoding
Advantages:• very good search results in a huge search space• manageable complexity• best efficiency / accuracy trade-off with current models
Disadvantages:• Markov assumption (independence outside of limited
history)• sentence-internal long-distance dependencies dramatically
increase search space and cause more search errors• difficult to integrate cross-sentence dependencies
måndag 24 september 18
Greedy Hill Climbing
InitializeFrom given state generate successor state
• apply randomly chosen operation at random locationCompute score for the new state
• may look at any place in the document• efficient score updates are necessary
Hill climbing• accept if the new state has a higher model score• continue until satisfied (or tired of waiting)
Implemented in the Docent decoder
måndag 24 september 18
Learning Curves
English-French Newswire (WMT task)
First-choice hill climbing
For a given state, generate a successor state by applyinga randomly chosen operation at a random locationin the document.Compute the score of the new state.Accept the new state if it is better, else keep the current state.Iterate until
a maximum number of steps is reached, orthere is a large number of rejections in a row.
Learning curves
English-French Newswire (WMT system)
1e+02 1e+04 1e+06 1e+08
−100
00−6
000
−200
020
00
Standard models
steps
Nor
mal
ised
sco
re
●
with beam searchw/o beam search
●
●
●
●
●
●●
●● ● ● ● ● ● ● ●
1e+02 1e+04 1e+06 1e+08
0.00
0.05
0.10
0.15
0.20
0.25
Standard models
steps
BLEU
●
with beam searchw/o beam search
●●
●●
●
●
●
●
●●
● ● ● ● ● ● ● ● ● ●
Part II
Pronominal Anaphora
måndag 24 september 18
Selected Discourse-Level Features
Lexical cohesion models• motivation: one sense per discourse / translation• document language model using distributional semantics
Lexical consistency• penalise inconsistent translation options
Stylistic features, target group adaptation• improved readability• simplified translation• format / formatting constraints
måndag 24 september 18
Document-Level Decoding in Docent
DOCENTdocument translation
change operation &score update
document translation
accept / rejectchange
MOSESor random
initialize
Local search with stochastic search operations
• start with a complete (suboptimal) solution• apply small changes anywhere to improve it• hill-climbing or annealing
final doc translationstep limit (108)
rejection limit (105)
måndag 24 september 18
Document-Level Decoding
Initial state• randomly selected segmentation and translation• or initialised by beam search with local features only
Search operations• simple operations that open entire search space• randomly selecting next operation
Search and Termination• accept useful changes• until step limit is reached• or rejection limit is reached Search is non-deterministic
måndag 24 september 18
Example: Initial Step
• that prevention this disease
• unfortunately , there are is not a miracle cure contribute to pre- venting get cancer but
• in spite to the made progress in ’ scientific research there remains the a healthy ways of life the best solution , the if of the risks decrease , in with him disease this .
måndag 24 september 18
Example: Step 8,192
• that prevention of the disease
• unfortunately , there are no miracle cure for preventing cancer .
• in spite of the progress in research remains the adoption of a healthy lifestyle is the best way , about the risk to reduce , in him on developing them .
måndag 24 september 18
Example: Step 134,217,728
• prevention of the disease
• unfortunately , there is no miracle cure for preventing cancer .
• despite the progress in research , the adoption of a healthy lifestyle , the best way to minimise the risk of him on ill .
måndag 24 september 18
Search Operations
Search steps: 8,192
that prevention of the disease
unfortunately , there are no miracle cure for preventing cancer .
in spite of the progress in research remains the adoption of ahealthy lifestyle is the best way , about the risk to reduce , in himon developing them .
Search steps: 134,217,728
prevention of the disease
unfortunately , there is no miracle cure for preventing cancer .
despite the progress in research , the adoption of a healthy lifestyle, the best way to minimise the risk of him on ill .
Local search operations for phrase-based SMT
er geht ja nicht nach hause
it is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
måndag 24 september 18
Search Operations
Search steps: 8,192
that prevention of the disease
unfortunately , there are no miracle cure for preventing cancer .
in spite of the progress in research remains the adoption of ahealthy lifestyle is the best way , about the risk to reduce , in himon developing them .
Search steps: 134,217,728
prevention of the disease
unfortunately , there is no miracle cure for preventing cancer .
despite the progress in research , the adoption of a healthy lifestyle, the best way to minimise the risk of him on ill .
Local search operations for phrase-based SMT
er geht ja nicht nach hause
it is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not afterhouse
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Change translation option:• change the translation of one randomly selected phrase
måndag 24 september 18
Search Operations
Search steps: 8,192
that prevention of the disease
unfortunately , there are no miracle cure for preventing cancer .
in spite of the progress in research remains the adoption of ahealthy lifestyle is the best way , about the risk to reduce , in himon developing them .
Search steps: 134,217,728
prevention of the disease
unfortunately , there is no miracle cure for preventing cancer .
despite the progress in research , the adoption of a healthy lifestyle, the best way to minimise the risk of him on ill .
Local search operations for phrase-based SMT
er geht ja nicht nach hause
it is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not afterhouse
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Change translation option:• change the translation of one randomly selected phrase
Resegment:• find a new segmentation into phrases for a randomly
selected sequence of adjacent source words
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not afterhouse
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
måndag 24 september 18
Search Operations
Search steps: 8,192
that prevention of the disease
unfortunately , there are no miracle cure for preventing cancer .
in spite of the progress in research remains the adoption of ahealthy lifestyle is the best way , about the risk to reduce , in himon developing them .
Search steps: 134,217,728
prevention of the disease
unfortunately , there is no miracle cure for preventing cancer .
despite the progress in research , the adoption of a healthy lifestyle, the best way to minimise the risk of him on ill .
Local search operations for phrase-based SMT
er geht ja nicht nach hause
it is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not afterhouse
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Change translation option:• change the translation of one randomly selected phrase
Resegment:• find a new segmentation into phrases for a randomly
selected sequence of adjacent source wordsSwap phrases:
• Exchange the positions of two phrases
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not afterhouse
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is yes not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not after house
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
Local search operations for phrase-based SMT
er geht ja nicht nach hause
he is not afterhouse
Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.
måndag 24 september 18
Summary
Document-Level Decoding• local search with stochastic change operations• flexible framework that allows long-distance dependencies• possible to initialize with beam search
Discourse-Level SMT• respect connectedness of sentences• treat phenomena that differ between languages• difficult to achieve visible (positive) effects
måndag 24 september 18
Coming Up
This week• Choose your preferred project topics (today!)• Assigment on PBSMT and Moses• Second lab 1 session• Lecture on Thursday cancelled• Start discussing/working on projects
Next week• Assignment on document-decoding with Docent• NMT lecture
Later• More NMT• Guest lecture• Project work and student seminars
måndag 24 september 18
Course organization
I will be on parental leave from October 2nd
Who to ask questions:• General formal questions about course organization,
grading etc → Mats• Labs and assignments → Zhengxian• NMT part of course → Gongbo• Project-related questions → Gongbo, Zhengxian
måndag 24 september 18