decoding in smt and more - cl.lingfil.uu.sethe log-linear model: decoding and tuning in statistical...

Decoding in SMTand more

Sara StymneSeptember 24, 2018

Partly based on slides by Christian Hardmeier and Jörg Tiedemann

måndag 24 september 18

Road Map

Decoding Phrase-Based SMT Models• iterative hypothesis expansion• pruning• beam search using stack decoding

Factored translationDocument-level translation


Translating with SMT Models

SMT Model• statistical model of the translation process• various components• millions and billions of parameters and options

SMT Decoder• implements search for a good translation candidate• tries to find translation hypothesis with the highest model

score (but may return sub-optimal solutions)• important features: accuracy and efficiency


Decoding Phrase-Based SMT

Search problem: • Find the best translation among all possible translation

candidates

The Log-Linear Model:

Decoding and Tuning inStatistical Machine Translation

Christian Hardmeier

2015-05-20

Decoding

The decoder is the part of the SMT systemthat creates the translations.

Given a set of models, how can we translateefficiently and accurately?

Decoding

Find the best translation among all possible translations.

t⇤ = arg maxt

f (s, t) = arg maxt

X

i

�ihi (s, t)

f (s, t) Scoring functionhi (s, t) Feature functions�i Feature weights


Model Error vs. Search Error

Model Error• The solution with the highest score according to our model

is not an acceptable translation• Model deficiency

Search Error• The decoder fails to find the solution with the highest model

score• Shortcomings of the search algorithm


The Generative Model of PB-SMT

Model error vs. search error

Model error: The solution with the highest scoreunder our models is not a good translation.Search error: The decoder cannot find the solutionwith the highest model score.

Phrase-based SMT: Generative Model

Bakom huset hittade polisen en stor mängd narkotika .

Behind the house found police a large quantity of narcotics .

Behind the house police found a large quantity of narcotics .

1 Phrase segmentation2 Phrase translation3 Output ordering





Behind the housethe house police

house police foundpolice found a

found a large


The Generative Model of PB-SMT

Model error vs. search error

Model error: The solution with the highest scoreunder our models is not a good translation.Search error: The decoder cannot find the solutionwith the highest model score.





1 Phrase segmentation2 Phrase translation3 Output ordering





Behind the housethe house police

house police foundpolice found a

found a largeContext-window fromN-gram language model


Translation OptionsTranslation Options

Illustrations by Philipp Koehn

Translation Options

he

er geht ja nicht nach hause

it, it

, he

isare

goesgo

yesis

, of course

notdo not

does notis not

afterto

according toin

househome

chamberat home

notis not

does notdo not

homeunder housereturn home

do not

it ishe will be

it goeshe goes

isare

is after alldoes

tofollowingnot after

not to

,

notis not

are notis not a

• Many translation options to choose from

– in Europarl phrase table: 2727 matching phrase pairs for this sentence– by pruning to the top 20 per phrase, 202 translation options remain

Decoding 8Decoding by Hypothesis Expansion


Decoding: Hypothesis Expansion


are

it

hegoes

does not

yes

go

to

home

home

also create hypotheses from created partial hypothesis

Decoding 14

Is it always possible to translate any sentence in this way?What would cause the process to break downso the decoder can’t find a translationthat covers the whole input sentence?How could you make sure that this never happens?Decoding: Hypothesis Expansion


are

it

hegoes

does not

yes

go

to

home

home


Decoding 14

Get (all) translation options for all possible segmentations of the input sentence to be translated!


Using the available translation optionscreate translation hypothesis from left to right:

Decoding by Hypothesis Expansion

Translation Options


Translation Options

he


it, it

, he

isare

goesgo

yesis

, of course

notdo not

does notis not

afterto

according toin

househome

chamberat home

notis not

does notdo not


do not

it ishe will be

it goeshe goes

isare

is after alldoes


not to

,

notis not

are notis not a







are

it

hegoes

does not

yes

go

to

home

home


Decoding 14



are

it

hegoes

does not

yes

go

to

home

home


Decoding 14


• Is it always possible to translate any sentence in this way?• What would cause the process to break down

so the decoder can’t find a translationthat covers the whole input sentence?

• How could you make sure that this never happens?

Decoding by Hypothesis Expansion

Translation Options


Translation Options

he


it, it

, he

isare

goesgo

yesis

, of course

notdo not

does notis not

afterto

according toin

househome

chamberat home

notis not

does notdo not


do not

it ishe will be

it goeshe goes

isare

is after alldoes


not to

,

notis not

are notis not a







are

it

hegoes

does not

yes

go

to

home

home


Decoding 14



are

it

hegoes

does not

yes

go

to

home

home


Decoding 14


Decoding Complexity

Naively, in a sentence of N words with T translation options for each phrase, we can have

• O(2N) phrase segmentations,• O(TN) sets of phrase translations,• O(N!) word reordering permutations


Exploiting Model Locality

What we need to score a new hypothesis is• the score of the previous hypothesis• the translation model score• the new language model score

Decoding complexity

Naively, in a sentence of N words with T translation options foreach phrase, we can have

O(2N ) phrase segmentations,O(TN ) sets of phrase translations, andO(N!) word reordering permutations.

Exploiting Model Locality


Behind the house police found a big

To score a new hypothesis, we need:the score of the previous hypothesisthe translation model scorethe new language model scores

Hypothesis recombination

The translation model only looks at the current phrase.The n-gram model only looks at a window of n words.The choices the decoder makes are independent ofeverything beyond this window!The decoder never reconsiders its choices once they’vemoved out of the n-gram history.

given

contextindependent

limitedwindow


Hypothesis Recombination

Context-Independent Translation Model• looks only at the current phrase

N-gram Language Model• looks only at a window of N words

Decoder choices are independent of everything beyond this window!

• no need to reconsider alternative hypotheses once they have moved out of the n-gram history



Example:• Three hypotheses with the same coverage• trigram language model

Competing hypotheses can be discarded because they will never beat the winning one later on!


Suppose we have these hypotheses with the same coverage,and we use a trigram language model:

After the house police Score = –12.5

Behind the house police Score = –11.2

, the house police Score = –22.0

We already know the winner!We can discard the competing hypotheses.


Hypothesis recombination combines branchesin the search graph:

It’s a form of dynamic programming.Recombination reduces the search space substantially. . .. . . it preserves search optimality. . .. . . but decoding is still exponential!

Pruning

To make decoding really efficient,we expand only hypotheses that look promising.Bad hypotheses should be pruned earlyto avoid wasting time on them.Pruning compromises search optimality!



In the search graph: Combine branches

It is a form of dynamic programming• substantial reduction of the search space• search is still optimal• but decoding is still exponential

184 Chapter 6. Decoding

it is

it is

it is

Figure 6.3: Recombination example: Two hypothesis paths lead to two matchinghypotheses: they have the same number of foreign words translated, and the sameEnglish words in the output, but di↵erent scores. In the heuristic search they canbe recombined, the worse hypothesis is dropped.

it

he

does not

does not

it

he does not

Figure 6.4: More complex recombination example: In heuristic search with atrigram language model, also here two hypotheses can be recombined. Both hy-potheses are the same in respected to subsequent costs: The language model stillconsiders the last two words produced (does not), but not the third last word. Allother later costs are independent of the words produced so far.

worse hypothesis can never be part of the best overall path. For any paththat includes the worse hypothesis, we can substitute it with the better one,and get a better score. Thus, we can drop the worse hypothesis.

Identical English output is not required for hypothesis recombination,as the example in Figure 6.4 illustrates. Here we are considering two pathsthat start with di↵erent translations for the first German word er, namelyit and he. But then we continue with the same translation option, skippingone German word, and translation ja nicht as does not.

Here, we have two hypotheses that look similar: they are identical intheir German coverage, but di↵er slightly in the English output. However,recall our argument on why we can recombine hypotheses. If any pathstarting from the worse hypothesis can also be used to extend the better hy-pothesis with the same costs, then we can safely drop the worse hypothesis,since it can never be part of the best path.

It does not matter in our two examples that the two hypothesis pathsdi↵er in their first English word. All subsequent phrase translation costs donot depend on already generated English words. Only the language model


Pruning

Pruning = cut away things that are most likely not useful at all

Pruning in SMT decoding• expand only promising hypotheses• discard bad hypotheses early to avoid wasting time

Pruning greatly improves efficiency• but compromises search optimality


Stack DecodingStack decoding


Stacks

are

it

he

goes does not

yes

no wordtranslated

one wordtranslated

two wordstranslated

three wordstranslated

• Hypothesis expansion in a stack decoder

– translation option is applied to hypothesis– new hypothesis is dropped into a stack further down

Decoding 21

Stack decoding algorithm

1: AddToStack(s0, h0)2: for i = 0 . . . N � 1 do3: for all h 2 si do4: for all t 2 T do5: if Applicable(h, t) then6: h0 Expand(h, t)7: j WordsCovered(h) + WordsCovered(t)8: AddToStack(s j , h0)9: end if

10: end for11: end for12: end for13: return best hypothesis on stack sN

Stacks

are

it

he

goes does not

yes

no wordtranslated

one wordtranslated

two wordstranslated




Decoding 21

AddToStack(s, h)

1: for all h0 2 s do2: if Recombinable(h, h0) then3: add higher-scoring of h,h0 to stack s, discard other4: return5: end if6: end for7: add h to stack s8: if stack too large then9: prune stack

10: end if

sorted by the number of input words covered by the given hypothesis

limited size for each stack


Stack Decoding Algorithm

Stack decoding


Stacks

are

it

he

goes does not

yes

no wordtranslated

one wordtranslated

two wordstranslated




Decoding 21




Stacks

are

it

he

goes does not

yes

no wordtranslated

one wordtranslated

two wordstranslated




Decoding 21

AddToStack(s, h)


10: end if

next slide


AddToStack(s,h)

Stack decoding


Stacks

are

it

he

goes does not

yes

no wordtranslated

one wordtranslated

two wordstranslated




Decoding 21




Stacks

are

it

he

goes does not

yes

no wordtranslated

one wordtranslated

two wordstranslated




Decoding 21

AddToStack(s, h)


10: end if

important for efficiency


Pruning

Histogram Pruning• keep no more that n hypotheses per stack• Parameter: Stack size n

Threshold Pruning• discard hypotheses with low scores compared to the score

of the best hypothesis on the same stack h*• Score(h) < a * Score(h*)• Parameter: threshold factor a


Beam Search Complexity

Setup:• For each of the N words in the input sentence• expand n hypotheses• by considering T translation options each

Complexity: O(n * N * T)

Number of translation options is linear in sentence length:• complexity: O(n * N2)


Distortion Limits

Limit reordering reduces search space dramatically• most partial hypotheses cover the same input• number of translation options at each stop is bound by a

constantIs it OK to do that?

• for closely related languages: most reordering is local• we do not have good reordering models anyway• could do pre-ordering if necessary

How to prune

Histogram pruningKeep no more than S hypotheses per stack.Parameter: Stack size S

Threshold pruningDiscard hypotheses whose score is very low compared tothat of the best hypothesis on the stack h⇤:

Score(h) < ⌘ · Score(h⇤)

Parameter: Beam size ⌘

Beam search: Complexity

For each of the N words in the input sentence,expand S hypothesesby considering T translation options each:

O(S · N · T )

The number of translation options is linear in the sentence length:

O(S · N2)

Distortion limit

When translating between closely related languages,most reorderings are local. . .. . . and anyhow, we haven’t got any reasonable modelsfor long-range reordering!If we impose a limit on reordering, the number of translationoptions to consider at each step is bounded by a constant.


Behind the house policelimited distortion window

decoder complexity:

O(n*N)


Incremental Scoring and Cherry Picking

Distortion limit

When translating between closely related languages,most reorderings are local. . .. . . and anyhow, we haven’t got any reasonable modelsfor long-range reordering!If we impose a limit on reordering, the number of translationoptions to consider at each step is bounded by a constant.

The number of hypotheses expanded by a beamsearch decoder with limited reordering is linear inthe stack size and the input size:

O(S · N )

Incremental scoring and cherry picking


Behind the house police found


Behind the house police a big

2 1

04 3

Incremental scoring and cherry picking

The path that looks cheapest necessarily incursa much higher cost later.Pruning may discard better options before this is recognised.To make scores more comparable, we shouldtake into account unavoidable future costs.Compare hypotheses based on current score + future score.

Problem: Delay expensive decisions and loose good hypotheses due to early pruning


Incremental Scoring and Cherry Picking

Cherry picking problems• easy decisions first• promising hypotheses may become very expensive later• pruning discards hypotheses that are cheaper in the end

Solution• estimate an approximate future cost• add future cost estimates to each hypotheses


Future Cost Estimation

Exact future cost require full decoding

Approximation using additional assumptions• independence between models• ignore LM history across phrase boundaries

Future cost estimation

Calculating the future cost exactly would amountto full decoding!Cheaper approximations can be computed bymaking additional independence assumptions.

Assume independence between models.Ignore LM history across phrase boundaries.Cost Estimates from Translation Options

the tourism initiative addresses this for the first time

-1.0 -2.0 -1.5 -2.4 -1.0 -1.0 -1.9 -1.6-1.4

-4.0 -2.5

-1.3

-2.2

-2.4

-2.7

-2.3

-2.3

-2.3

cost of cheapest translation options for each input span (log-probabilities)

Decoding 27


DP Beam Search Decoding: Evaluation

DP beam search is by far the most popular search algorithmfor phrase-based SMT.It combines high speed with reasonable accuracy byexploiting the constraints of the standard models.It works well with very local models.

Sentence-internal long-range dependenciesincrease search errors by inhibiting recombination.No cross-sentence dependencies on the target side.

Current state of the art: Almost perfect local fluency, butserious problems with long-range reordering anddiscourse-level phenomena.

Feature Weight Tuning

f (s, t) =X

i

�ihi (s, t)

The SMT model is a weighted sum of feature scores.Translation modelLanguage modelDistortion model

The feature models are trained on corpus data.But where do we get the weights �i from?


Summary on Decoding

Dynamic Programming Beam Search• most popular search algorithm (but not the only one)• high speed and reasonable accuracy• fits well with the feature functions of standard PB-SMT

Limitations• not efficient with non-local models (modeling long-distance

dependencies)• not possible to include cross-sentence dependencies on the

target language side


Factored translation


No use of morphology:• treat inflectional variants (“look”, “looks”, “looked”) as

completely different words!• in learning translation models: knowing how to translate

“look” doesn’t help to translate “looks”

Works fine for English (and reasonable amounts of data)

Problems:• morphologically rich languages• sparse data sets• flexible word order

Problems with PBSMT


Represent words by factors

Factored models (1)4

Factored translation models

• Factored represention of words

word word

part-of-speech

OutputInput

morphology

part-of-speech

morphology

word class

lemma

word class

lemma

......• Goals

– Generalization, e.g. by translating lemmas, not surface forms– Richer model, e.g. using syntax for reordering, language modeling)

Koehn, U Edinburgh ESSLLI Summer School Day 5

5

Related work

• Back o! to representations with richer statistics (lemma, etc.)[Nießen and Ney, 2001, Yang and Kirchho! 2006, Talbot and Osborne 2006]

• Use of additional annotation in pre-processing (POS, syntax trees, etc.)[Collins et al., 2005, Crego et al, 2006]

• Use of additional annotation in re-ranking (morphological features, POS,syntax trees, etc.)[Och et al. 2004, Koehn and Knight, 2005]

! we pursue an integrated approach

• Use of syntactic tree structure[Wu 1997, Alshawi et al. 1998, Yamada and Knight 2001, Melamed 2004,Menezes and Quirk 2005, Chiang 2005, Galley et al. 2006]

! may be combined with our approach

Koehn, U Edinburgh ESSLLI Summer School Day 5109


Morphology• is productive• well understood• generalizable patterns

Factored models• learn translations of base forms• learn to map morphology• learn to generate target surface form

Factored models (2)


Represent words by factors? Why?• combine scores for translating various factors• back-off to other factors (lemma)• use various factors for reordering• better word alignment (?)Better generalization• can translate words that we haven’t seen in training• better statistics for translation optionsRicher model (more (linguistic) information)• PoS, syntactic function, semantic role, ...

Factored models (3)


Factored model example (1)Factored Translation Models Syntax-Oriented Statistical Models Example-based MT

Decomposing translation: example

lemma lemma

part-of-speech

OutputInput

morphology

part-of-speech

word word

Itranslate lemma and POS separately

Igenerate surface word forms from translated factors

Jörg Tiedemann 5/37

translation steps

generation stepanalysis step


Use benefits of general phrase-based SMT!• factored models as alternative paths (or backoff)

Factored model example (2)Factored Translation Models Syntax-Oriented Statistical Models Example-based MT

Factored modelsI Basic phrase-based SMT is very powerful!I Why generalizing if we know specific translation?

lemma lemma

part-of-speech

OutputInput

morphology

part-of-speech

word wordor

I prefer surface model for known wordsI use morphgen model as back-off



Do not always lead to improvement

Complicated models are slow compared to standard PBSMT

Factored model resultsFactored Translation Models Syntax-Oriented Statistical Models Example-based MT

Factored models: Results & Summary

Some Results (German/English, Koehn):

System In-domain Out-of-domainBaseline 18.19 15.01

With POS LM 19.05 15.03Morphgen model 14.38 11.65Both model paths 19.47 15.23

I factors on token levelI flexible SMT frameworkI many possible factors & translation/generation stepsI not much success yet ...



Simpler models often more useful!

Factored model example (3)

Word Word

POS

Useful with POS LM


Often useful with POS/morphology LMsNot much slower than standard modelsTend to give some improvements to agreement

Example: Improve word order of split compounds:Number of compound modifiers without a head:

System without POS-model: 136 System with a POS-model: 6

Simple factored models


Word-based SMTPhrase-based SMT• Factored SMT

Still not happy? What’s the problem?

SMT Approaches


Word-based SMTPhrase-based SMT• Factored SMT

Still not happy? What’s the problem?• Very simple reordering models• Only consecutive phrases (word sequences)• Long-distance dependencies / agreement• Problems with ungrammatical output

Tree-based SMT• Addresses these issues to some extent• But, computationally more complex, with an ever larger search

space

SMT Approaches


A family of approaches (with confusing terminology)

• hierarchical phrase-based models (Hiero)• tree-to-string, string-to-tree, tree-to-tree• syntax-directed MT• syntax-augmented MT• syntactified phrase-based MT• string-to-dependency• dependency treelet-based

Tree-based SMT


Document-level MT


Discourse-Level Phenomena

Translate the following to German:

Ich sagte: “Stell es zurück!”

Aber es war nicht ihr Schlüssel.

Ich sagte: “Leg ihn zurück!”

Das Mädchen nahm meinen Schlüssel aus dem Schloss.

sein (?)

Ich sagte: “Steck ihn wieder rein!”

• The girl took my key from the door lock.

• But it wasn’t her key!

• I said: “Put it back!”


Connectedness of Natural Language

Textual cohesion• discourse markers• referential devices (e.g. pronominal anaphora)• ellipses (word omissions) and substitutions• lexical cohesion (word repetition, collocation)

Textual coherence• semantically meaningful relations• logical tense structure• presuppositions and implications connected to general

world knowledge


Discourse and Machine Translation

Long-distance relations are lost in local MT models• sentence-by-sentence translation• limited context window

Discourse-level devices do not easily map between languages• explicit vs. implicit discourse markers• grammatical agreement in anaphoric relations

Terminological consistency• domain-specific requirements


Pronominal Anaphora

Translating pronouns requires target-language dependencies

• The funeral of the Queen Mother will take place on Friday. It will be broadcast live.

• Les funérailles de la reine-mère auront lieu vendredi. Elles seront retransmises en direct.

Alternative translation:

• L’enterrement de la reine-mère aura lieu vendredi. Il sera retransmis en direct.


Beam Search Decoding

Advantages:• very good search results in a huge search space• manageable complexity• best efficiency / accuracy trade-off with current models

Disadvantages:• Markov assumption (independence outside of limited

history)• sentence-internal long-distance dependencies dramatically

increase search space and cause more search errors• difficult to integrate cross-sentence dependencies


Greedy Hill Climbing

InitializeFrom given state generate successor state

• apply randomly chosen operation at random locationCompute score for the new state

• may look at any place in the document• efficient score updates are necessary

Hill climbing• accept if the new state has a higher model score• continue until satisfied (or tired of waiting)

Implemented in the Docent decoder


Learning Curves

English-French Newswire (WMT task)

First-choice hill climbing

For a given state, generate a successor state by applyinga randomly chosen operation at a random locationin the document.Compute the score of the new state.Accept the new state if it is better, else keep the current state.Iterate until

a maximum number of steps is reached, orthere is a large number of rejections in a row.

Learning curves

English-French Newswire (WMT system)

1e+02 1e+04 1e+06 1e+08

−100

00−6

000

−200

020

00

Standard models

steps

Nor

mal

ised

sco

re

●

with beam searchw/o beam search

●

●

●

●

●

●●

●● ● ● ● ● ● ● ●

1e+02 1e+04 1e+06 1e+08

0.00

0.05

0.10

0.15

0.20

0.25

Standard models

steps

BLEU

●

with beam searchw/o beam search

●●

●●

●

●

●

●

●●

● ● ● ● ● ● ● ● ● ●

Part II

Pronominal Anaphora


Selected Discourse-Level Features

Lexical cohesion models• motivation: one sense per discourse / translation• document language model using distributional semantics

Lexical consistency• penalise inconsistent translation options

Stylistic features, target group adaptation• improved readability• simplified translation• format / formatting constraints


Document-Level Decoding in Docent

DOCENTdocument translation

change operation &score update

document translation

accept / rejectchange

MOSESor random

initialize

Local search with stochastic search operations

• start with a complete (suboptimal) solution• apply small changes anywhere to improve it• hill-climbing or annealing

final doc translationstep limit (108)

rejection limit (105)


Document-Level Decoding

Initial state• randomly selected segmentation and translation• or initialised by beam search with local features only

Search operations• simple operations that open entire search space• randomly selecting next operation

Search and Termination• accept useful changes• until step limit is reached• or rejection limit is reached Search is non-deterministic


Example: Initial Step

• that prevention this disease

• unfortunately , there are is not a miracle cure contribute to preventing get cancer but

• in spite to the made progress in ’ scientific research there remains the a healthy ways of life the best solution , the if of the risks decrease , in with him disease this .


Example: Step 8,192

• that prevention of the disease

• unfortunately , there are no miracle cure for preventing cancer .

• in spite of the progress in research remains the adoption of a healthy lifestyle is the best way , about the risk to reduce , in him on developing them .


Example: Step 134,217,728

• prevention of the disease

• unfortunately , there is no miracle cure for preventing cancer .

• despite the progress in research , the adoption of a healthy lifestyle , the best way to minimise the risk of him on ill .


Search Operations

Search steps: 8,192

that prevention of the disease

unfortunately , there are no miracle cure for preventing cancer .

in spite of the progress in research remains the adoption of ahealthy lifestyle is the best way , about the risk to reduce , in himon developing them .

Search steps: 134,217,728

prevention of the disease

unfortunately , there is no miracle cure for preventing cancer .

despite the progress in research , the adoption of a healthy lifestyle, the best way to minimise the risk of him on ill .

Local search operations for phrase-based SMT


it is yes not after house

Change phrase translation: Change the translationof one phrase.Resegment: Find a new phrase segmentation for a sequenceof adjacent source words.Swap phrases: Exchange the positions of two target phrases.


Search Operations

Search steps: 8,192














he is yes not after house




he is not after house




he is not afterhouse


Change translation option:• change the translation of one randomly selected phrase


Search Operations

Search steps: 8,192

























Resegment:• find a new segmentation into phrases for a randomly

selected sequence of adjacent source words














Search Operations

Search steps: 8,192

























Resegment:• find a new segmentation into phrases for a randomly

selected sequence of adjacent source wordsSwap phrases:

• Exchange the positions of two phrases


























Summary

Document-Level Decoding• local search with stochastic change operations• flexible framework that allows long-distance dependencies• possible to initialize with beam search

Discourse-Level SMT• respect connectedness of sentences• treat phenomena that differ between languages• difficult to achieve visible (positive) effects


Coming Up

This week• Choose your preferred project topics (today!)• Assigment on PBSMT and Moses• Second lab 1 session• Lecture on Thursday cancelled• Start discussing/working on projects

Next week• Assignment on document-decoding with Docent• NMT lecture

Later• More NMT• Guest lecture• Project work and student seminars


Course organization

I will be on parental leave from October 2nd

Who to ask questions:• General formal questions about course organization,

grading etc → Mats• Labs and assignments → Zhengxian• NMT part of course → Gongbo• Project-related questions → Gongbo, Zhengxian


decoding in smt and more - cl.lingfil.uu.sethe log-linear model: decoding and tuning in statistical...

Documents