7. trevor cohn (usfd) statistical machine translation
TRANSCRIPT
![Page 1: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/1.jpg)
Statistical Machine Translation
Part II: Decoding
Trevor Cohn, U. Sheffield
EXPERT winter school
November 2013
Some figures taken from Koehn 2009
![Page 2: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/2.jpg)
Recap
You’ve seen several models of translation
word-based models: IBM 1-5
phrase-based models
grammar-based models
Methods for
learning translation rules from bitexts
learning rule weights
learning several other features: language models,
reordering etc
![Page 3: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/3.jpg)
Decoding
Central challenge is to predict a good translation
Given text in the input language (f )
Generate translation in the output language (e)
Formally
where our model scores each candidate translation e using a translation model and a language model
A decoder is a search algorithm for finding e*
caveat: few modern systems use actual probabilities
![Page 4: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/4.jpg)
Outline
Decoding phrase-based models
linear model
dynamic programming approach
approximate beam search
Decoding grammar-based models
synchronous grammars
string-to-string decoding
![Page 5: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/5.jpg)
Decoding objective
Objective
Where model, f, incorporates
translation frequencies for phrases
distortion cost based on (re)ordering
language model cost of m-grams in e
...
Problem of ambiguity
may be many different sequences of translation decisions mapping f to e
e.g. could translate word by word, or use larger units
![Page 6: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/6.jpg)
Decoding for derivations
A derivation is a sequence of translation decisions
can “read off” the input string f and output e
Define model over derivations not translations
aka Viterbi approximation
should sum over all derivations within the maximisation
instead we maximise for tractability
But see Blunsom, Cohn and Osborne (2008)
sum out derivational ambiguity (during training)
![Page 7: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/7.jpg)
Decoding
Includes a coverage constraint
all input words must be translated exactly once
preserves input information
Cf. ‘fertility’ in IBM word-based models
phrases licence one to many mapping (insertions) and
many to one (deletions)
but limited to contiguous spans
Tractability effects on decoding
![Page 8: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/8.jpg)
Translation process
Translate this sentence
translate input words and “phrases”
reorder output to form target string
Derivation = sequence of phrases
1. er – he; 2. ja nicht – does not;
3. geht – go; 4. nach hause – home
Figure from Machine Translation Koehn 2009
![Page 9: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/9.jpg)
Generating process
er geht ja nicht nach hause
1: segment
2: translate
3: order
Consider the translation decisions in a derivation
![Page 10: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/10.jpg)
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
Generating process
![Page 11: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/11.jpg)
Generating process
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
he go does not home
![Page 12: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/12.jpg)
Generating process
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
he go does not home
he godoes not home
1: uniform cost (ignore)
2: TM probability
3: distortion cost & LM probability
![Page 13: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/13.jpg)
Generating process
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
he go does not home
he godoes not home
f = 0
+ φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home)
+ ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)
![Page 14: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/14.jpg)
Linear Model
Assume a linear model
d is a derivation
φ(rk) is the log conditional frequency of a phrase pair
d is the distortion cost for two consecutive phrases
ψ is the log language model probability
each component is scaled by a separate weight
Often mistakenly referred to as log-linear
![Page 15: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/15.jpg)
Model components
Typically:
language model and word count
translation model (s)
distortion cost
Values of α learned by discriminative training (not covered today)
![Page 16: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/16.jpg)
Search problem
Given options
1000s of possible output strings
he does not go home
it is not in house
yes he goes not to home …
Figure from Machine Translation Koehn 2009
![Page 17: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/17.jpg)
Search Complexity
Search space
Number of segmentations 32 = 26
Number of permutations 720 = 6!
Number of translation options 4096 = 46
Multiplying gives 94,371,840 derivations
(calculation is naïve, giving loose upper bound)
How can we possibly search this space?
especially for longer input sentences
![Page 18: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/18.jpg)
Search insight
Consider the sorted list of all derivations
…
he does not go after home
he does not go after house
he does not go home
he does not go to home
he does not go to house
he does not goes home
…
Many similar derivations, each
with highly similar scores
![Page 19: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/19.jpg)
Search insight #1
he / does not / go / home
he / does not / go / to home
f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)
f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → to home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(to| go) + ψ(home| to) + d(2) + ψ(</S>| home)
![Page 20: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/20.jpg)
Search insight #1
Consider all possible ways to finish the translation
![Page 21: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/21.jpg)
Search insight #1
Score ‘f’ factorises, with shared components across all options.
Can find best completion by maximising f.
![Page 22: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/22.jpg)
Search insight #2 Several partial translations can be finished the same way
![Page 23: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/23.jpg)
Search insight #2 Several partial translations can be finished the same way
Only need to consider maximal scoring partial translation
![Page 24: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/24.jpg)
Dynamic Programming Solution
Key ideas behind dynamic programming
factor out repeated computation
efficiently solve the maximisation problem
What are the key components for “sharing”?
don’t have to be exactly identical; need same:
set of untranslated words
righter-most output words
last translated input word location
The decoding algorithm aims to exploit this
![Page 25: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/25.jpg)
More formally
Considering the decoding maximisation
where d ranges over all derivations covering f
We can split maxd into maxd1 maxd2 …
move some ‘maxes’ inside the expression, over elements
not affected by that rule
bracket independent parts of expression
Akin to Viterbi algorithm in HMMs, PCFGs
![Page 26: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/26.jpg)
Phrase-based Decoding
Start with empty state
Figure from Machine Translation Koehn 2009
![Page 27: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/27.jpg)
Phrase-based Decoding
Expand by choosing input span and generating translation
Figure from Machine Translation Koehn 2009
![Page 28: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/28.jpg)
Phrase-based Decoding
Consider all possible options to start the translation
Figure from Machine Translation Koehn 2009
![Page 29: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/29.jpg)
Phrase-based Decoding
Continue to expand states, visiting uncovered words. Generating outputs left to right.
Figure from Machine Translation Koehn 2009
![Page 30: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/30.jpg)
Phrase-based Decoding
Read off translation from best complete derivation by back-tracking
Figure from Machine Translation Koehn 2009
![Page 31: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/31.jpg)
Dynamic Programming
Recall that shared structure can be exploited
vertices with same coverage, last output word, and input
position are identical for subsequent scoring
Maximise over these paths
aka “recombination” in the MT literature (but really just
dynamic programming)
⇒
Figure from Machine Translation Koehn 2009
![Page 32: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/32.jpg)
Complexity
Even with DP search is still intractable
word-based and phrase-based decoding is NP complete
Knight 99; Zaslavskiy, Dymetman, and Cancedda, 2009
whereas SCFG decoding is polynomial
Complexity arises from
reordering model allowing all permutations (limit)
no more than 6 uncovered words
many translation options (limit)
no more than 20 translations per phrase
coverage constraints, i.e., all words to be translated once
![Page 33: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/33.jpg)
Pruning
Limit the size of the search graph by eliminating bad paths
early
Pharaoh / Moses
divide partial derivations into stacks, based on number of
input words translated
limit the number of derivations in each stack
limit the score difference in each stack
![Page 34: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/34.jpg)
Stack based pruning
Algorithm iteratively “grows” from one stack to the next larger ones, while pruning the entries in each stack.
Figure from Machine Translation Koehn 2009
![Page 35: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/35.jpg)
Future cost estimate
Higher scores for translating easy parts first
language model prefers common words
Early pruning will eliminate derivations starting with the difficult words
pruning must incorporate estimate of the cost of translating the
remaining words
“future cost estimate” assuming unigram LM and monotone translation
Related to A* search and admissible heuristics
but incurs search error (see Chang & Collins, 2011)
![Page 36: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/36.jpg)
Beam search complexity
Limit the number of translation options per phrase to constant (often 20)
# translations proportional to input sentence length
Stack pruning
number of entries & score ratio
Reordering limits
finite number of uncovered words (typically 6)
but see Lopez EACL 2009
Resulting complexity
O( stack size x sentence length )
![Page 37: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/37.jpg)
k-best outputs
Can recover not just the best solution
but also 2nd, 3rd etc best derivations
straight-forward extension of beam search
Useful in discriminative training of feature weights, and other
applications
![Page 38: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/38.jpg)
Alternatives for PBMT decoding
FST composition (Kumar & Byrne, 2005)
each process encoded in WFST or WFSA
simply compose automata, minimise and solve
A* search (Och, Ueffing & Ney, 2001)
Sampling (Arun et al, 2009)
Integer linear programming
Germann et al, 2001
Reidel & Clarke, 2009
Lagrangian relaxation
Chang & Collins, 2011
![Page 39: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/39.jpg)
Outline
Decoding phrase-based models
linear model
dynamic programming approach
approximate beam search
Decoding grammar-based models
tree-to-string decoding
string-to-string decoding
cube pruning
![Page 40: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/40.jpg)
Grammar-based decoding
Reordering in PBMT poor, must limit
otherwise too many bad choices available
and inference is intractable
better if reordering decisions were driven by context
simple form of lexicalised reordering in Moses
Grammar based translation
consider hierarchical phrases with gaps (Chiang 05)
(re)ordering constrained by lexical context
inform process by generating syntax tree (Venugopal & Zollmann, 06; Galley et al, 06)
exploit input syntax (Mi, Huang & Liu, 08)
![Page 41: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/41.jpg)
Hierarchical phrase-based MT
have diplomatic relations with Australia
yu Aozhou you bangjiao
have diplomatic relations with Australia
yu Aozhou you bangjiao
Standard PBMT
Hierarchical PBMT
Must ‘jump’ back and forth to obtain correct ordering. Guided primarily by language model.
Grammar rule encodes this common reordering: yu X1 you X2 → have X2 with X1
also correlates yu … you and have … with.
Example from Chiang, CL 2007
![Page 42: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/42.jpg)
SCFG recap
Rules of form
can include aligned gaps
can include informative non-terminal categories (NN, NP, VP etc)
yu you have withX X X X
X X
![Page 43: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/43.jpg)
SCFG generation
Synchronous grammars generate parallel texts
Further:
applied to one text, can generate the other text
leverage efficient monolingual parsing algorithms
yu you have withX X X X
X X
bangiaoAozhou dipl. relations Australia
![Page 44: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/44.jpg)
SCFG extraction from bitexts Step 1: identify aligned
phrase-pairs
Step 2: “subtract” out subsumed
phrase-pairs
![Page 45: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/45.jpg)
Example grammar
yu youX1 X2
X
X
Aozhou
X
bangiao
S
X
have withX2 X1
X
X
Australia
X
diplomatic relations
S
X
![Page 46: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/46.jpg)
Decoding as parsing
Consider only the foreign side of grammar
yu youX X
X
Aozhou
X
yu youX X
X
Aozhou bangiao
S
S
Xbangiao
X
Step 1: parse input text
![Page 47: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/47.jpg)
with has XX
X
Australiadipl. rels
S
Step 2: Translate
yu youX X
X
Aozhou bangiao
S
yu youX X
X
Australia dipl. rels
S
Traverse tree, replacing each input production with its highest scoring output side
![Page 48: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/48.jpg)
X0,4
Chart parsing
youAozhou bangiaoyu0 1 2 3 4
X1,2 X3,4
X0,2X2,4
S0,2
S0,4
2. length = 2
X → yu X
X → you X
S → X
1. length = 1 X → Aozhou
X → bangjiao
4. length = 4 S → S X X → yu X you X
Two derivations yielding S0,4
Take the one with
maximum score
![Page 49: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/49.jpg)
Chart parsing for decoding
• starting at full sentence S0,J
• traverse down to find maximum score derivation
• translate each rule using the maximum scoring right-hand side
• emit the output string
youAozhou bangiaoyu
X1,2 X3,4
X0,2X2,4
S0,2
S0,4
0 1 2 3 4
X0,4
![Page 50: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/50.jpg)
LM intersection
Very efficient
cost of parsing, i.e., O(n3)
reduces to linear if we impose a maximum span limit
translation step simple O(n) post-processing step
But what about the language model?
CYK assumes model scores decompose with the tree structure
but the language model must span constituents
Problem: LM doesn’t factorise!
![Page 51: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/51.jpg)
LM intersection via lexicalised NTs
Encode LM context in NT categories (Bar-Hillel et al, 1964) X → <yu X1 you X2, have X2 with X1>
haveXb → <yu aXb1 you cXd2, have aXb2 with cXd1>
left & right m-1 words in output translation
When used in parent rule, LM can access boundary words
score now factorises with tree
![Page 52: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/52.jpg)
LM intersection via lexicalised NTs
yu youX X
X
X
Aozhou
X
bangiao
S
X
yu youaXb cXd
withXb
AustraliaXAustralia
Aozhou
diplomaticXrelations
bangiao
S
aXb
➠
φTM + ψ(with → c) + ψ(d → has) + ψ(has → a)
φTM
φTM
φTM
φTM
φTM
φTM
φTM + ψ(<S> → a) + ψ(b → </S>)
![Page 53: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/53.jpg)
+LM Decoding
Same algorithm as before
Viterbi parse with input side grammar (CYK)
for each production, find best scoring output side
read off output string
But input grammar has blown up
number of non-terminals is O(T2)
overall translation complexity of O(n3T4(m-1))
Terrible!
![Page 54: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/54.jpg)
Beam search and pruning
Resort to beam search
prune poor entries from chart cells during CYK parsing
histogram, threshold as in phrase-based MT
rarely have sufficient context for LM evaluation
Cube pruning
lower order LM estimate search heuristic
follows approximate ‘best first’ order for incorporating child spans into
parent rule
stops once beam is full
For more details, see
Chiang “Hierarchical phrase-based translation”. 2007. Computational
Linguistics 33(2):201–228.
![Page 55: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/55.jpg)
Further work
Synchronous grammar systems
SAMT (Venugopal & Zollman, 2006)
ISI’s syntax system (Marcu et al.,2006)
HRGG (Chiang et al., 2013)
Tree to string (Liu, Liu & Lin, 2006)
Probabilistic grammar induction
Blunsom & Cohn (2009)
Decoding and pruning
cube growing (Huang & Chiang, 2007)
left to right decoding (Huang & Mi, 2010)
![Page 56: 7. Trevor Cohn (usfd) Statistical Machine Translation](https://reader034.vdocument.in/reader034/viewer/2022052315/5550266ab4c9059f318b4712/html5/thumbnails/56.jpg)
Summary
What we covered
word based translation and alignment
linear phrase-based and grammar-based models
phrase-based (finite state) decoding
synchronous grammar decoding
What we didn’t cover
rule extraction process
discriminative training
tree based models
domain adaptation
OOV translation
…