towards syntactically constrained statistical word alignment
DESCRIPTION
Towards Syntactically Constrained Statistical Word Alignment. Greg Hanneman 11-734: Advanced Machine Translation Seminar April 30, 2008. Outline. The word alignment problem Base approaches Syntax-based approaches Distortion models Tree-to-string models Tree-to-tree models Discussion. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/1.jpg)
Towards Syntactically Towards Syntactically Constrained Statistical Constrained Statistical
Word AlignmentWord Alignment
Greg HannemanGreg Hanneman
11-734: Advanced Machine Translation SeminarApril 30, 2008
![Page 2: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/2.jpg)
Outline
• The word alignment problem
• Base approaches
• Syntax-based approaches
– Distortion models
– Tree-to-string models
– Tree-to-tree models
• Discussion
![Page 3: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/3.jpg)
Word Alignment
• Parallel sentence pair: F and E
• Most general: map a subset of F to a subset of E
![Page 4: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/4.jpg)
Word Alignment
• Very large alignment spaces!
– An n-word parallel sentence has n2 possible links and 2n2 possible alignments
– Restrict to one-to-one alignments: n! possible alignments
• Alignment models try to restrict or learn a probability distribution over this space to get the “best” alignment of a sentence
![Page 5: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/5.jpg)
Outline
• The word alignment problem
• Base approaches
• Syntax-based approaches
– Distortion models
– Tree-to-string models
– Tree-to-tree models
• Discussion
![Page 6: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/6.jpg)
A Generative Story[Brown et al. 1990]
The proposal will not be implemented
English sentence
Fertility
Les propositions neseront pas applicationmises en
Lexical generation
Les propositions ne seront pas applicationmises en
Distortion
![Page 7: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/7.jpg)
The Framework
• F: words f1 … fj … fn
• E: words e1 … ei … em
• Compute P(F, A | E) for hidden alignment variable A: a1 … aj … an
– The major step: decomposition, model parameters, EM algorithm, etc.
• aj = i: word fj is aligned to word ei
![Page 8: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/8.jpg)
The IBM Models[Brown et al. 1993; Och and Ney 2003]
• Model 1: “Bag of words” — word order doesn’t affect alignment
• Model 2: Position of words being aligned does matter
![Page 9: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/9.jpg)
The IBM Models[Brown et al. 1993; Och and Ney 2003]
• Later models use more implicit structural or linguistic information, but not really syntax, and not really overtly
– Fertility: P(φ | ei) of ei producing φ words in F
– Distortion: P(τ, π | E) for a set of F words τ in a permutation π
– Previous alignments: Probs. for positions in F of the different words of a fertile ei
![Page 10: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/10.jpg)
The HMM Model[Vogel et al. 1996; Och and Ney 2003]
• Linguistic intuition: words, and their alignments, tend to clump together in clusters
• aj depends on absolute size of “jump” between it and aj–1
![Page 11: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/11.jpg)
Discriminative Training
• Consider all possible alignments, score them, and pick the best ones under some set of constraints
• Can incorporate arbitrary features; generative models more fixed
• Generative models’ EM requires lots of unlabeled training data; discriminative requires some labeled data
![Page 12: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/12.jpg)
Discriminative Alignment[Taskar et al. 2005]
•
– Co-occurrence
– Position difference
– Co-occurrence of following words
– Word-frequency rank
– Model 4 prediction
– …
The
proposal
will
not
be
implemented
Les
propositions
ne
seront
pas
application
mises
en
),(),( jiT
ji fefev fw
![Page 13: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/13.jpg)
Outline
• The word alignment problem
• Base approaches
• Syntax-based approaches
– Distortion models
– Tree-to-string models
– Tree-to-tree models
• Discussion
![Page 14: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/14.jpg)
Syntax-Based Approaches
• Constrain alignment space by looking beyond flat text stream: take higher-level sentence structure into account
• Representations
– Constituency structure
– Inversion Transduction Grammar
– Dependency structure
![Page 15: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/15.jpg)
An MT Motivation
![Page 16: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/16.jpg)
Syntax-Based Distortion[DeNero and Klein 2007]
• Syntax-based MT should start from syntax-aware word alignments
• HMM model + target-language parse trees: prefer alignments that respect tree
• Handled in distortion model: jumps should reflect tree structure
![Page 17: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/17.jpg)
Syntax-Based Distortion[DeNero and Klein 2007]
• HMM distortion: size of jump between aj–1 and aj
• Syntactic distortion: tree path between aj–
1 and aj
![Page 18: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/18.jpg)
Syntax-Based Distortion[DeNero and Klein 2007]
• Training:100,000 parallel French–English and Chinese–English sentences with English parse trees
• Both E→F and F → E; combined with different unions and intersections, plus thresholds
• Test: Hand-aligned Hansards and NIST MT 2002 data
![Page 19: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/19.jpg)
Syntax-Based Distortion[DeNero and Klein 2007]
• HMMs roughly equal, better than GIZA++
• Soft union for French; hard union for Chinese; competitive thresholding
![Page 20: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/20.jpg)
Tree-to-String Models
![Page 21: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/21.jpg)
Tree-to-String Models
• New generative story
• Word-level fertility and distortion replaced with node insertion and sibling reordering
• Lexical translation still the same
• Word alignment produced as a side effect from lexical translations
![Page 22: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/22.jpg)
Tree-to-String Alignment[Yamada and Knight 2001]
• Discussed in other sessions this semester
• Training: 2121 short Japanese–English sentences, modified Collins parser output for English
• Test: First 50 sentences of training corpus
• Beat IBM Model 5 on human judgements; perplexity between Model 1 and Model 5
![Page 23: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/23.jpg)
Subtree Cloning[Gildea 2003]
• Original tree-to-string model is too strict
– Syntactic divergences, reordering
• Soft constraint: allow alignments that violate tree structure, but at a cost
– Tweak the tree side of the alignment to contain things needed for the string side
– Ex.: SVO to OSV
![Page 24: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/24.jpg)
Subtree Cloning[Gildea 2003]
S
VP
AUX VP
do ADVP VB
RB
entirely
understand
NP
I
PRP
NP
PRP$ NN
your language
NP
I
PRP
![Page 25: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/25.jpg)
Subtree Cloning[Gildea 2003]
S
VP
AUX VP
do
NP
I
PRP
ADVP VB
RB
entirely
understand
NP
PRP$ NN
your language
NP
I
PRP
![Page 26: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/26.jpg)
Subtree Cloning[Gildea 2003]
S
VP
AUX VP
do
NP
I
PRP
ADVP VB
RB
entirely
understand
NP
PRP$ NN
your language
NP
I
PRP
men ti
NULL NULL
ni hua wo tu
tung
![Page 27: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/27.jpg)
Subtree Cloning[Gildea 2003]
• For a node np:
– Probability of cloning something as a new child of np: single EM-learned constant for all np
– Probability of making that clone a node nc: uniform over all nc
• Surprising that this works…
![Page 28: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/28.jpg)
Subtree Cloning[Gildea 2003]
• Compared with IBM 1–3, basic tree-to-string, basic tree-to-tree models
• Training: 4982 Korean–English sentence pairs, with manual Korean parse trees
• Test: 101 hand-aligned held-out sentences
![Page 29: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/29.jpg)
Subtree Cloning[Gildea 2003]
• Cloning helps: as good or better than IBM
• Tree-to-tree model runs faster
![Page 30: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/30.jpg)
Tree-to-Tree Models
• Alignment must conform to tree structure on both sides — space is more constrained
• Requires more transformation operations to handle divergent structures [Gildea 2003]
• Or we could be more permissive…
![Page 31: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/31.jpg)
Inversion Transduction Grammar
[Wu 1997]
• For bilingual parsing; get one-to-one word alignment as a side effect
• Parallelbinary-branchingtrees with reordering
![Page 32: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/32.jpg)
ITG Operations
• A → [A A]
– Produce “A1 A2” in source and target streams
• A → <A A>
– Produce “A1 A2” in source stream, “A2 A1” in target stream
• A → e / f
– Produce “e” in source stream, “f” in target stream
![Page 33: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/33.jpg)
ITG Operations
• “Canonical form” ITG produces only one derivation for a given alignment
– S → A | B | C
– A → [A B] | [B B] | [C B] | [A C] | [B C] | [C C]
– B → <A A> | <B A> | <C A> | <A C> | <B C> | <C C>
– C → e / f
![Page 34: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/34.jpg)
Alignment with ITG[Zhang and Gildea 2004]
• Compared IBM 1, IBM 4, ITG, and tree-to-string (with and without cloning)
• Training: Chinese–English (18,773) and French–English (20,000) sentences less than 25 words long
• Test: Hand-aligned Chinese–English (48) and French–English (447)
![Page 35: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/35.jpg)
Alignment with ITG[Zhang and Gildea 2004]
• ITG best, or at least as good as IBM or tree-to-string plus cloning
• ITG has no linguistic syntax…
![Page 36: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/36.jpg)
Dependency Parsing
• Discussed in other sessions this semester
• Notion of violating “phrasal cohesion”
– Usually bad, but not always
![Page 37: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/37.jpg)
Dependencies + ITG[Cherry and Lin 2006]
• Find invalid dependency spans; assign score of –∞ if used by the ITG parser
• Simple model: maximize co-occurrence score with penalty for distant words
• ITG reduces AER by 13% relative; dependencies + ITG reduce by 34%
nj
mi
jiji feφfev 52 10),(),(
![Page 38: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/38.jpg)
Dependencies + ITG[Cherry and Lin 2006]
• Discriminative training with an SVM
• Feature vector for each ITG rule instance
– Features from Taskar et al. [2005]
– Feature marking ITG inversion rules
– Feature (penalty) marking invalid spans based on dependency tree
![Page 39: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/39.jpg)
Dependencies + ITG[Cherry and Lin 2006]
• Compared Taskar et al. to D-ITG with hard and soft constraints
• Training: 50,000 French–English sentence pairs for counts and probabilities; 100 hand-annotated pairs with derived ITG trees for discriminative training
• Test: 347 hand-annotated sentences from 2003 parallel text workshop
![Page 40: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/40.jpg)
Dependencies + ITG[Cherry and Lin 2006]
• Relative improvement smaller in discriminative training scenario with stronger objective function
• Hard constraint starts to hurt recall
![Page 41: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/41.jpg)
Outline
• The word alignment problem
• Base approaches
• Syntax-based approaches
– Distortion models
– Tree-to-string models
– Tree-to-tree models
• Discussion
![Page 42: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/42.jpg)
All These Tradeoffs…
• Mathematical and statistical correctness vs. computability
• Simple model vs. capturing linguistic phenomena
• Not enough syntactic information vs. too much syntactic information
• Ruling out bad alignments vs. keeping good alignments around
![Page 43: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/43.jpg)
• Completely unconstrained: every alignment link (ei, fj) either “on” or “off”
• Permutation space: one-to-one alignment with reordering [Taskar et al. 2005]
• ITG space: permutation space satisfying binary tree constraint [Wu 1997]
• Dependency space: permutation space maintaining phrasal cohesion
Alignment Spaces
![Page 44: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/44.jpg)
Alignment Spaces
• D-ITG space: Dependency ∩ ITG space [Cherry and Lin 2006]
• HD-ITG space: D-ITG space where each span must contain a head [Cherry and Lin 2006a]
![Page 45: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/45.jpg)
Examining Alignment Spaces
[Cherry and Lin 2006a]• Alignment score
– Learned co-occurrence score
– Gold-standard oracle score
![Page 46: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/46.jpg)
Examining Alignment Spaces
[Cherry and Lin 2006a]• Learned co-occurrence score
– More restricted spaces give better results
![Page 47: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/47.jpg)
Examining Alignment Spaces
[Cherry and Lin 2006a]• Oracle score: subsets of permutation
space
– ITG rules out almost nothing correct
– Beam search in dependency space does worst
![Page 48: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/48.jpg)
Conclusions
• Base alignment models are mathematical, limited notions of sentence structure
• Syntax-aware alignment helpful for syntax-aware MT [DeNero and Klein 2007]
• Using structure as a hard constraint is harmful for divergent sentences; tweaking trees [Gildea 2003] or using soft constraints [Cherry and Lin 2006] helps fix this
![Page 49: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/49.jpg)
Conclusions
• Surprise winner: ITG
– Computationally straightforward
– Permissive, simple grammar that mostly only rules out bad alignments [Cherry and Lin 2006a]
– Does a lot, even when it’s not the best
• Discriminative framework looks promising and flexible — can incorporate generative models as features [Taskar et al. 2005]
![Page 50: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/50.jpg)
Towards the Future
• Easy-to-run GIZA++ made complicated IBM models the norm — promising discriminative or syntax-based models currently lack such a toolkit
• Syntax-based discriminative techniques — morphology, POS, semantic information…
• Any other ideas?
![Page 51: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/51.jpg)
References• Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty,
R. Mercer, and P. Roossin, “A statistical approach to machine translation,” Computational Linguistics, 16(2):79-85, 1990.
• Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” Computational Linguistics, 19(2):263-311.
• Cherry, Colin and Dekang Lin, “Soft syntactic constraints for word alignment through discriminative training,” Proceedings of the COLING/ACL Poster Session, 105-112, 2006.
• Cherry, Colin and Dekang Lin, “A comparison of syntactically motivated alignment spaces,” Proceedings of EACL, 145-152, 2006a.
• DeNero, John and Dan Klein, “Tailoring word alignments to syntactic machine translation,” Proceedings of ACL, 17-24, 2007.
• Gildea, Daniel, “Loosely tree-based alignment for machine translation,” Proceedings of ACL, 80-87, 2003.
![Page 52: Towards Syntactically Constrained Statistical Word Alignment](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e54550346895dbbe634/html5/thumbnails/52.jpg)
References• Och, Franz and Hermann Ney, “A systematic comparison of various
statistical alignment models,” Computational Linguistics, 29(1):19-51, 2003.
• Taskar, B., S. Lacoste-Julien, and D. Klein, “A discriminative matching approach to word alignment,” Proceedings of HLT/EMNLP, 73-80, 2005.
• Vogel, S., H. Ney, and C. Tillmann, “HMM-based word alignment in statistical translation,” Proceedings of COLING, 836-841, 1996.
• Wu, Dekai, “Stochastic inversion transduction grammars and bilingual parsing of parallel corpora,” Computational Linguistics, 23(3):377-403.
• Yamada, Kenji and Kevin Knight, “A syntax-based statistical translation model,” Proceedings of ACL, 523-530, 2001.
• Zhang, Hao and Daniel Gildea, “Syntax-based alignment: Supervised or unsupervised?” Proceedings of COLING, 418-424, 2004.