dual decomposition for natural language...
TRANSCRIPT
![Page 1: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/1.jpg)
Dual Decompositionfor Natural Language Processing
Alexander M. Rush and Michael Collins
![Page 2: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/2.jpg)
Decoding complexity
focus: decoding problem for natural language tasks
y∗ = arg maxy
f (y)
motivation:
• richer model structure often leads to improved accuracy
• exact decoding for complex models tends to be intractable
![Page 3: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/3.jpg)
Decoding tasksmany common problems are intractable to decode exactly
high complexity
• combined parsing and part-of-speech tagging (Rush et al.,2010)
• “loopy” HMM part-of-speech tagging
• syntactic machine translation (Rush and Collins, 2011)
NP-Hard
• symmetric HMM alignment (DeNero and Macherey, 2011)
• phrase-based translation (Chang and Collins, 2011)
• higher-order non-projective dependency parsing (Koo et al.,2010)
in practice:
• approximate decoding methods (coarse-to-fine, beam search,cube pruning, gibbs sampling, belief propagation)
• approximate models (mean field, variational models)
![Page 4: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/4.jpg)
Motivation
cannot hope to find exact algorithms (particularly when NP-Hard)
aim: develop decoding algorithms with formal guarantees
method:
• derive fast algorithms that provide certificates of optimality
• show that for practical instances, these algorithms often yieldexact solutions
• provide strategies for improving solutions or findingapproximate solutions when no certificate is found
dual decomposition helps us develop algorithms of this form
![Page 5: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/5.jpg)
Lagrangian relaxation (Held and Karp, 1971)
important method from combinatorial optimization
initially used for traveling salesman problems
optimal tour - NP-Hard
1
2
3
4
5
6
7
optimal 1-tree - easy (MST)
1
2
3
4
5
6
7
![Page 6: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/6.jpg)
Dual decomposition (Komodakis et al., 2010; Lemarechal, 2001)
goal: solve complicated optimization problem
y∗ = arg maxy
f (y)
method: decompose into subproblems, solve iteratively
benefit: can choose decomposition to provide “easy” subproblems
aim for simple and efficient combinatorial algorithms
• dynamic programming
• minimum spanning tree
• shortest path
• min-cut
• bipartite match
• etc.
![Page 7: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/7.jpg)
Related work
there are related methods used NLP with similar motivationrelated methods:
• belief propagation (particularly max-product) (Smith andEisner, 2008)
• factored A* search (Klein and Manning, 2003)
• exact coarse-to-fine (Raphael, 2001)
aim to find exact solutions without exploring the full search space
![Page 8: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/8.jpg)
Tutorial outline
focus:
• developing dual decomposition algorithms for new NLP tasks
• understanding formal guarantees of the algorithms
• extensions to improve exactness and select solutions
outline:
1. worked algorithm for combined parsing and tagging
2. important theorems and formal derivation
3. more examples from parsing, sequence labeling, MT
4. practical considerations for implementing dual decomposition
5. relationship to linear programming relaxations
6. further variations and advanced examples
![Page 9: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/9.jpg)
1. Worked example
aim: walk through a dual decomposition algorithm for combinedparsing and part-of-speech tagging
• introduce formal notation for parsing and tagging
• give assumptions necessary for decoding
• step through a run of the dual decomposition algorithm
![Page 10: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/10.jpg)
Combined parsing and part-of-speech tagging
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
goal: find parse tree that optimizes
score(S → NP VP) + score(VP → V NP) +
...+ score(N→ V) + score(N→ United) + ...
![Page 11: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/11.jpg)
Constituency parsingnotation:
• Y is set of constituency parses for input• y ∈ Y is a valid parse• f (y) scores a parse tree
goal:arg max
y∈Yf (y)
example: a context-free grammar for constituency parsing
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
![Page 12: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/12.jpg)
Part-of-speech taggingnotation:
• Z is set of tag sequences for input
• z ∈ Z is a valid tag sequence
• g(z) scores of a tag sequence
goal:arg max
z∈Zg(z)
example: an HMM for part-of speech tagging
United1 flies2 some3 large4 jet5
N V D A N
![Page 13: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/13.jpg)
Identifying tagsnotation: identify the tag labels selected by each model
• y(i , t) = 1 when parse y selects tag t at position i
• z(i , t) = 1 when tag sequence z selects tag t at position i
example: a parse and tagging with y(4,A) = 1 and z(4,A) = 1
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y
United1 flies2 some3 large4 jet5
N V D A N
z
![Page 14: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/14.jpg)
Combined optimization
goal:arg max
y∈Y,z∈Zf (y) + g(z)
such that for all i = 1 . . . n, t ∈ T ,
y(i , t) = z(i , t)
i.e. find the best parse and tagging pair that agree on tag labels
equivalent formulation:
arg maxy∈Y
f (y) + g(l(y))
where l : Y → Z extracts the tag sequence from a parse tree
![Page 15: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/15.jpg)
Dynamic programming intersectioncan solve by solving the product of the two models
example:
• parsing model is a context-free grammar
• tagging model is a first-order HMM
• can solve as CFG and finite-state automata intersection
replace S → NP VPwithSN,N → NPN,NVPV,N
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
![Page 16: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/16.jpg)
Parsing assumptionthe structure of Y could be CFG, TAG, etc.
assumption: optimization with u can be solved efficiently
arg maxy∈Y
f (y) +∑
i ,t
u(i , t)y(i , t)
generally benign since u can be incorporated into the structure of f
example: CFG with rule scoring function h
f (y) =∑
X→Y Z∈yh(X → Y Z ) +
∑
(i ,X )∈yh(X → wi )
where
arg maxy∈Y f (y) +∑
i ,t
u(i , t)y(i , t) =
arg maxy∈Y∑
X→Y Z∈yh(X → Y Z ) +
∑
(i ,X )∈y(h(X → wi ) + u(i ,X ))
![Page 17: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/17.jpg)
Tagging assumptionwe make a similar assumption for the set Zassumption: optimization with u can be solved efficiently
arg maxz∈Z
g(z)−∑
i ,t
u(i , t)z(i , t)
example: HMM with scores for transitions T and observations O
g(z) =∑
t→t′∈zT (t → t ′) +
∑
(i ,t)∈zO(t → wi )
where
arg maxz∈Z g(z)−∑
i ,t
u(i , t)z(i , t) =
arg maxz∈Z∑
t→t′∈zT (t → t ′) +
∑
(i ,t)∈z(O(t → wi )− u(i , t))
![Page 18: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/18.jpg)
Dual decomposition algorithm
Set u(1)(i , t) = 0 for all i , t ∈ T
For k = 1 to K
y (k) ← arg maxy∈Y
f (y) +∑
i ,t
u(k)(i , t)y(i , t) [Parsing]
z(k) ← arg maxz∈Z
g(z)−∑
i ,t
u(k)(i , t)z(i , t) [Tagging]
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Else u(k+1)(i , t)← u(k)(i , t)− αk(y (k)(i , t)− z(k)(i , t))
![Page 19: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/19.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 20: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/20.jpg)
CKY Parsing S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 21: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/21.jpg)
CKY Parsing S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A N
N V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 22: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/22.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A N
N V D A N
A N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 23: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/23.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A N
N V D A N
A N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 24: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/24.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 25: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/25.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 26: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/26.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A N
A N D A N
A N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 27: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/27.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A N
A N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 28: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/28.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 29: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/29.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 30: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/30.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 31: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/31.jpg)
CKY Parsing
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
A
United
N
flies
D
some
A
large
VP
V
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
United
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,t
u(i , t)y(i , t))
Viterbi Decoding
United1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑
i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)Key
f (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 32: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/32.jpg)
Main theorem
theorem: if at any iteration, for all i , t ∈ T
y (k)(i , t) = z(k)(i , t)
then (y (k), z(k)) is the global optimum
proof: focus of the next section
![Page 33: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/33.jpg)
Convergence
0
20
40
60
80
100
<=1<=2
<=3<=4
<=10<=20
<=50
% e
xam
ples
con
verg
ed
number of iterations
![Page 34: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/34.jpg)
2. Formal properties
aim: formal derivation of the algorithm given in the previoussection
• derive Lagrangian dual
• prove three properties
I upper bound
I convergence
I optimality
• describe subgradient method
![Page 35: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/35.jpg)
Lagrangiangoal:
arg maxy∈Y,z∈Z
f (y) + g(z) such that y(i , t) = z(i , t)
Lagrangian:
L(u, y , z) = f (y) + g(z) +∑
i ,t
u(i , t) (y(i , t)− z(i , t))
redistribute terms
L(u, y , z) =
f (y) +
∑
i ,t
u(i , t)y(i , t)
+
g(z)−
∑
i ,t
u(i , t)z(i , t)
![Page 36: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/36.jpg)
Lagrangian dual
Lagrangian:
L(u, y , z) =
f (y) +
∑
i ,t
u(i , t)y(i , t)
+
g(z)−
∑
i ,t
u(i , t)z(i , t)
Lagrangian dual:
L(u) = maxy∈Y,z∈Z
L(u, y , z)
= maxy∈Y
f (y) +
∑
i ,t
u(i , t)y(i , t)
+
maxz∈Z
g(z)−
∑
i ,t
u(i , t)z(i , t)
![Page 37: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/37.jpg)
Theorem 1. Upper bound
define:
• y∗, z∗ is the optimal combined parsing and tagging solutionwith y∗(i , t) = z∗(i , t) for all i , t
theorem: for any value of u
L(u) ≥ f (y∗) + g(z∗)
L(u) provides an upper bound on the score of the optimal solution
note: upper bound may be useful as input to branch and bound orA* search
![Page 38: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/38.jpg)
Theorem 1. Upper bound (proof)
theorem: for any value of u, L(u) ≥ f (y∗) + g(z∗)
proof:
L(u) = maxy∈Y,z∈Z
L(u, y , z) (1)
≥ maxy∈Y,z∈Z:y=z
L(u, y , z) (2)
= maxy∈Y,z∈Z:y=z
f (y) + g(z) (3)
= f (y∗) + g(z∗) (4)
![Page 39: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/39.jpg)
Formal algorithm (reminder)
Set u(1)(i , t) = 0 for all i , t ∈ T
For k = 1 to K
y (k) ← arg maxy∈Y
f (y) +∑
i ,t
u(k)(i , t)y(i , t) [Parsing]
z(k) ← arg maxz∈Z
g(z)−∑
i ,t
u(k)(i , t)z(i , t) [Tagging]
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Else u(k+1)(i , t)← u(k)(i , t)− αk(y (k)(i , t)− z(k)(i , t))
![Page 40: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/40.jpg)
Theorem 2. Convergencenotation:
• u(k+1)(i , t)← u(k)(i , t) + αk(y (k)(i , t)− z(k)(i , t)) is update
• u(k) is the penalty vector at iteration k
• αk is the update rate at iteration k
theorem: for any sequence α1, α2, α3, . . . such that
limt→∞
αt = 0 and∞∑
t=1
αt =∞,
we havelimt→∞
L(ut) = minu
L(u)
i.e. the algorithm converges to the tightest possible upper bound
proof: by subgradient convergence (next section)
![Page 41: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/41.jpg)
Dual solutions
define:
• for any value of u
yu = arg maxy∈Y
f (y) +
∑
i ,t
u(i , t)y(i , t)
and
zu = arg maxz∈Z
g(z)−
∑
i ,t
u(i , t)z(i , t)
• yu and zu are the dual solutions for a given u
![Page 42: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/42.jpg)
Theorem 3. Optimality
theorem: if there exists u such that
yu(i , t) = zu(i , t)
for all i , t then
f (yu) + g(zu) = f (y∗) + g(z∗)
i.e. if the dual solutions agree, we have an optimal solution
(yu, zu)
![Page 43: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/43.jpg)
Theorem 3. Optimality (proof)
theorem: if u such that yu(i , t) = zu(i , t) for all i , t then
f (yu) + g(zu) = f (y∗) + g(z∗)
proof: by the definitions of yu and zu
L(u) = f (yu) + g(zu) +∑
i ,t
u(i , t)(yu(i , t)− zu(i , t))
= f (yu) + g(zu)
since L(u) ≥ f (y∗) + g(z∗) for all values of u
f (yu) + g(zu) ≥ f (y∗) + g(z∗)
but y∗ and z∗ are optimal
f (yu) + g(zu) ≤ f (y∗) + g(z∗)
![Page 44: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/44.jpg)
Dual optimization
Lagrangian dual:
L(u) = maxy∈Y,z∈Z
L(u, y , z)
= maxy∈Y
f (y) +
∑
i ,t
u(i , t)y(i , t)
+
maxz∈Z
g(z)−
∑
i ,t
u(i , t)z(i , t)
goal: dual problem is to find the tightest upper bound
minu
L(u)
![Page 45: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/45.jpg)
Dual subgradient
L(u) = maxy∈Y
f (y) +
∑
i,t
u(i , t)y(i , t)
+ max
z∈Z
g(z)−
∑
i,t
u(i , t)z(i , t)
properties:• L(u) is convex in u (no local minima)• L(u) is not differentiable (because of max operator)
handle non-differentiability by using subgradient descent
define: a subgradient of L(u) at u is a vector gu such that for all v
L(v) ≥ L(u) + gu · (v − u)
![Page 46: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/46.jpg)
Subgradient algorithm
L(u) = maxy∈Y
f (y) +
∑
i,t
u(i , t)y(i , t)
+ max
z∈Z
g(z)−
∑
i,j
u(i , t)z(i , t)
recall, yu and zu are the argmax’s of the two terms
subgradient:
gu(i , t) = yu(i , t)− zu(i , t)
subgradient descent: move along the subgradient
u′(i , t) = u(i , t)− α (yu(i , t)− zu(i , t))
guaranteed to find a minimum with conditions given earlier for α
![Page 47: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/47.jpg)
3. More examples
aim: demonstrate similar algorithms that can be applied to otherdecoding applications
• context-free parsing combined with dependency parsing
• corpus-level part-of-speech tagging
• combined translation alignment
![Page 48: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/48.jpg)
Combined constituency and dependency parsing(Rush et al., 2010)
setup: assume separate models trained for constituency anddependency parsing
problem: find constituency parse that maximizes the sum of thetwo models
example:
• combine lexicalized CFG with second-order dependency parser
![Page 49: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/49.jpg)
Lexicalized constituency parsingnotation:
• Y is set of lexicalized constituency parses for input• y ∈ Y is a valid parse• f (y) scores a parse tree
goal:arg max
y∈Yf (y)
example: a lexicalized context-free grammar
S(flies)
NP(United)
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
![Page 50: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/50.jpg)
Dependency parsing
define:
• Z is set of dependency parses for input
• z ∈ Z is a valid dependency parse
• g(z) scores a dependency parse
example:
*0 United1 flies2 some3 large4 jet5
![Page 51: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/51.jpg)
Identifying dependenciesnotation: identify the dependencies selected by each model
• y(i , j) = 1 when word i modifies of word j in constituencyparse y
• z(i , j) = 1 when word i modifies of word j in dependencyparse z
example: a constituency and dependency parse with y(3, 5) = 1and z(3, 5) = 1
S(flies)
NP(United)
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y
*0 United1 flies2 some3 large4 jet5
z
![Page 52: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/52.jpg)
Combined optimization
goal:arg max
y∈Y,z∈Zf (y) + g(z)
such that for all i = 1 . . . n, j = 0 . . . n,
y(i , j) = z(i , j)
![Page 53: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/53.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 54: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/54.jpg)
CKY Parsing S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 55: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/55.jpg)
CKY Parsing S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 56: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/56.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 57: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/57.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 58: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/58.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 59: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/59.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 60: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/60.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 61: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/61.jpg)
CKY Parsing
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
United
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑
i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 United1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑
i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 62: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/62.jpg)
Convergence
0
20
40
60
80
100
<=1<=2
<=3<=4
<=10<=20
<=50
% e
xam
ples
con
verg
ed
number of iterations
![Page 63: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/63.jpg)
Integrated Constituency and Dependency Parsing: Accuracy
87
88
89
90
91
92Collins
DepDual
F1 Score
I Collins (1997) Model 1
I Fixed, First-best Dependencies from Koo (2008)
I Dual Decomposition
![Page 64: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/64.jpg)
Corpus-level tagging
setup: given a corpus of sentences and a trained sentence-leveltagging model
problem: find best tagging for each sentence, while at the sametime enforcing inter-sentence soft constraints
example:
• test-time decoding with a trigram tagger
• constraint that each word type prefer a single POS tag
![Page 65: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/65.jpg)
Corpus-level tagging
English is my first langauge
He studies language arts now
Language makes us human beings
N
![Page 66: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/66.jpg)
Sentence-level decodingnotation:
• Yi is set of tag sequences for input sentence i• Y = Y1× . . .×Ym is set of tag sequences for the input corpus• Y ∈ Y is a valid tag sequence for the corpus• F (Y ) =
∑
i
f (Yi ) is the score for tagging the whole corpus
goal:arg max
Y∈YF (Y )
example: decode each sentence with a trigram tagger
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
![Page 67: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/67.jpg)
Inter-sentence constraintsnotation:
• Z is set of possible assignments of tags to word types
• z ∈ Z is a valid tag assignment
• g(z) is a scoring function for assignments to word types
example: an MRF model that encourages words of the same typeto choose the same tag
z1
language
N
language
N
language
N
N
z2
language
N
language
A
language
N
N
g(z1) > g(z2)
![Page 68: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/68.jpg)
Identifying word tagsnotation: identify the tag labels selected by each model
• Ys(i , t) = 1 when the tagger for sentence s at position iselects tag t
• z(s, i , t) = 1 when the constraint assigns at sentence sposition i the tag t
example: a parse and tagging with Y1(5,N) = 1 andz(1, 5,N) = 1
English is my first language
He studies language arts now
Y
language language language
z
![Page 69: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/69.jpg)
Combined optimization
goal:arg max
Y∈Y,z∈ZF (Y ) + g(z)
such that for all s = 1 . . .m, i = 1 . . . n, t ∈ T ,
Ys(i , t) = z(s, i , t)
![Page 70: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/70.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 71: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/71.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 72: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/72.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 73: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/73.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 74: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/74.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 75: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/75.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 76: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/76.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 77: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/77.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 78: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/78.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 79: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/79.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 80: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/80.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 81: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/81.jpg)
Tagging
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
A
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
English
N
is
V
my
P
first
A
language
N
He
P
studies
V
language
N
arts
N
now
R
Language
N
makes
V
us
P
human
N
beings
N
MRF
language
A
language
A
language
A
A
language
A
language
A
language
A
A
language
N
language
N
language
N
N
language
N
language
N
language
N
N
language
N
language
N
language
N
N
Penaltiesu(s, i , t) = 0 for all s,i ,t
Iteration 1
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
Iteration 2
u(1, 5,N) -1
u(1, 5,A) 1
u(3, 1,N) -1
u(3, 1,A) 1
u(2, 3,N) 1
u(2, 3,A) -1
Key
F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i
![Page 82: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/82.jpg)
Combined alignment (DeNero and Macherey, 2011)
setup: assume separate models trained for English-to-French andFrench-to-English alignment
problem: find an alignment that maximizes the score of bothmodels
example:
• HMM models for both directional alignments (assume correctalignment is one-to-one for simplicity)
![Page 83: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/83.jpg)
English-to-French alignmentdefine:
• Y is set of all possible English-to-French alignments• y ∈ Y is a valid alignment• f (y) scores of the alignment
example: HMM alignment
The1 ugly2 dog3 has4 red5 fur6
Le1 laid3 chien2 a4 rouge6 fourrure5
![Page 84: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/84.jpg)
French-to-English alignmentdefine:
• Z is set of all possible French-to-English alignments• z ∈ Z is a valid alignment• g(z) scores of an alignment
example: HMM alignment
Le1 chien2 laid3 a4 fourrure5 rouge6
The1 ugly2 dog3 has4 fur6 red5
![Page 85: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/85.jpg)
Identifying word alignmentsnotation: identify the tag labels selected by each model
• y(i , j) = 1 when e-to-f alignment y selects French word i toalign with English word j
• z(i , j) = 1 when f-to-e alignment z selects French word i toalign with English word j
example: two HMM alignment models with y(6, 5) = 1 andz(6, 5) = 1
![Page 86: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/86.jpg)
Combined optimization
goal:arg max
y∈Y,z∈Zf (y) + g(z)
such that for all i = 1 . . . n, j = 1 . . . n,
y(i , j) = z(i , j)
![Page 87: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/87.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 88: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/88.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 89: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/89.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 90: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/90.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 91: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/91.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 92: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/92.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,jIteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 93: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/93.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,jIteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 94: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/94.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,jIteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 95: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/95.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,jIteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 96: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/96.jpg)
English-to-French
y∗ = argmaxy∈Y
(f (y) +∑
i,j
u(i , j)y(i , j))
French-to-English
z∗ = argmaxz∈Z
(g(z)−∑
i,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,jIteration 1
u(3, 2) -1
u(2, 2) 1
u(2, 3) -1
u(3, 3) 1
Key
f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j
![Page 97: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/97.jpg)
4. Practical issues
aim: overview of practical dual decomposition techniques
• tracking the progress of the algorithm
• choice of update rate αk
• lazy update of dual solutions
• extracting solutions if algorithm does not converge
![Page 98: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/98.jpg)
Optimization tracking
at each stage of the algorithm there are several useful values
track:
• y (k), z(k) are current dual solutions
• L(u(k)) is the current dual value
• y (k), l(y (k)) is a potential primal feasible solution
• f (y (k)) + g(l(y (k))) is the potential primal value
![Page 99: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/99.jpg)
Tracking example
-19
-18
-17
-16
-15
-14
-13
0 10 20 30 40 50 60
Val
ue
Round
Current PrimalCurrent Dual
example run from syntactic machine translation (later in talk)
• current primalf (y (k)) + g(l(y (k)))
• current dualL(u(k))
![Page 100: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/100.jpg)
Optimization progress
useful signals:
• L(u(k))− L(u(k−1)) is the dual change (may be positive)
• mink
L(u(k)) is the best dual value (tightest upper bound)
• maxk
f (y (k)) + g(l(y (k))) is the best primal value
the optimal value must be between the best dual and primal values
![Page 101: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/101.jpg)
Progress example
-19
-18
-17
-16
-15
-14
-13
0 10 20 30 40 50 60
Val
ue
Round
Best PrimalBest Dual
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60
Val
ue
Round
Gap
best primal
maxk
f (y (k)) + g(l(y (k)))
best dual
mink
L(u(k))
gap
mink
L(uk) −
maxk
f (y (k)) + g(l(y (k))
![Page 102: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/102.jpg)
Update rate
choice of αk has important practical consequences
• αk too high causes dual value to fluctuate
• αk too low means slow progress
-16
-15.5
-15
-14.5
-14
-13.5
-13
0 5 10 15 20 25 30 35 40
Val
ue
Round
0.010.005
0.0005
![Page 103: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/103.jpg)
Update rate
practical: find a rate that is robust to varying inputs
• αk = c (constant rate) can be very fast, but hard to findconstant that works for all problems
• αk =c
k(decreasing rate) often cuts rate too aggressively,
lowers value even when making progress• rate based on dual progress
I αk =c
t + 1where t < k is number of iterations where dual
value increasedI robust in practice, reduces rate when dual value is fluctuating
![Page 104: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/104.jpg)
Lazy decoding
idea: don’t recompute y (k) or z(k) from scratch each iteration
lazy decoding: if subgradient u(k) is sparse, then y (k) may bevery easy to compute from y (k−1)
use:
• helpful if y or z factor naturally into independent components
• can be important for fast decompositions
![Page 105: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/105.jpg)
Lazy decoding example
recall corpus-level taggingexample
at this iteration, onlysentence 2 receives aweight update
with lazy decoding
Y(k)1 ← Y
(k−1)1
Y(k)3 ← Y
(k−1)3
![Page 106: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/106.jpg)
Lazy decoding results
lazy decoding is critical for the efficiency of some applications
0
5
10
15
20
25
30
0 1000 2000 3000 4000 5000
% o
f Hea
d A
utom
ata
Rec
ompu
ted
Iterations of Dual Decomposition
% recomputed, g+s% recomputed, sib
recomputation statistics for non-projective dependency parsing
![Page 107: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/107.jpg)
Approximate solution
upon agreement the solution is exact, but this may not occur
otherwise, there is an easy way to find an approximate solution
choose: the structure y (k ′) where
k ′ = arg maxk
f (y (k)) + g(l(y (k)))
is the iteration with the best primal score
guarantee: the solution yk′
is non-optimal by at most
(mink
L(uk))− (f (y (k ′)) + g(l(y (k ′))))
there are other methods to estimate solutions, for instance byaveraging solutions (see Nedic and Ozdaglar (2009))
![Page 108: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/108.jpg)
Choosing best solution
-30
-25
-20
-15
-10
-5
0
0 10 20 30 40 50 60 70
Val
ue
Round
Best Primal
Current PrimalCurrent Dual
non-exact example from syntactic translation
best approximate primal solution occurs at iteration 63
![Page 109: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/109.jpg)
Early stopping results
early stopping results for constituency and dependency parsing
50
60
70
80
90
100
0 10 20 30 40 50
Per
cent
age
Maximum Number of Dual Decomposition Iterations
f score% certificates
% match K=50
![Page 110: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/110.jpg)
Early stopping results
early stopping results for non-projective dependency parsing
50
60
70
80
90
100
0 200 400 600 800 1000
Per
cent
age
Maximum Number of Dual Decomposition Iterations
% validation UAS% certificates
% match K=5000
![Page 111: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/111.jpg)
Tightening
instead of using approximate solution, can tighten the algorithm
may help find an exact solution at the cost of added complexity
this technique is the focus of the next section
![Page 112: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/112.jpg)
5. Linear programming
aim: explore the connections between dual decomposition andlinear programming
• basic optimization over the simplex
• formal properties of linear programming
• full example with fractional optimal solutions
• tightening linear program relaxations
![Page 113: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/113.jpg)
Simplexdefine:
• ∆y ⊂ R|Y| is the simplex over Y where α ∈ ∆y implies
αy ≥ 0 and∑
y
αy = 1
• α is distribution over Y• ∆z is the simplex over Z• δy : Y → ∆y maps elements to the simplex
example:
Y = {y1, y2, y3}vertices
• δy (y1) = (1, 0, 0)
• δy (y2) = (0, 1, 0)
• δy (y3) = (0, 0, 1)
δy (y1)
δy (y2) δy (y3)
∆y
![Page 114: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/114.jpg)
Theorem 1. Simplex linear programoptimize over the simplex ∆y instead of the discrete sets Y
goal: optimize linear program
maxα∈∆y
∑
y
αy f (y)
theorem:maxy∈Y
f (y) = maxα∈∆y
∑
y
αy f (y)
proof: points in Y correspond to the exteme points of simplex
{δy (y) : y ∈ Y}
linear program has optimum at extreme point
note: finding the highest scoring distribution α over Yproof shows that best distribution chooses a single parse
![Page 115: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/115.jpg)
Combined linear program
optimize over the simplices ∆y and ∆z instead of the discrete setsY and Z
goal: optimize linear program
maxα∈∆y ,β∈∆z
∑
y
αy f (y) +∑
z
βzg(z)
such that for all i , t
∑
y
αyy(i , t) =∑
z
βzz(i , t)
note: the two distributions must match in expectation of POS tags
the best distributions α∗,β∗ are possibly no longer a single parsetree or tag sequence
![Page 116: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/116.jpg)
Lagrangian
Lagrangian:
M(u, α, β) =∑y
αy f (y) +∑z
βzg(z) +∑i,t
u(i , t)
(∑y
αyy(i , t)−∑z
βzz(i , t)
)
=
(∑y
αy f (y) +∑i,t
u(i , t)∑y
αyy(i , t)
)+
(∑z
βzg(z)−∑i,t
u(i , t)∑z
βzz(i , t)
)
Lagrangian dual:
M(u) = maxα∈∆y ,β∈∆z
M(u, α, β)
![Page 117: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/117.jpg)
Theorem 2. Strong duality
define:
• α∗, β∗ is the optimal assignment to α, β in the linear program
theorem:minu
M(u) =∑
y
α∗y f (y) +∑
z
β∗zg(z)
proof: by linear programming duality
![Page 118: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/118.jpg)
Theorem 3. Dual relationship
theorem: for any value of u,
M(u) = L(u)
note: solving the original Lagrangian dual also solves dual of thelinear program
![Page 119: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/119.jpg)
Theorem 3. Dual relationship (proof sketch)
focus on Y term in Lagrangian
L(u) = maxy∈Y
f (y) +
∑
i,t
u(i , t)y(i , t)
+ . . .
M(u) = maxα∈∆y
∑
y
αy f (y) +∑
i,t
u(i , t)∑
y
αyy(i , t)
+ . . .
by theorem 1. optimization over Y and ∆y have the same max
similar argument for Z gives L(u) = M(u)
![Page 120: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/120.jpg)
Summary
f (y) + g(z) original primal objectiveL(u) original dual∑
y αy f (y) +∑
z βzg(z) LP primal objective
M(u) LP dual
relationship between LP dual, original dual, and LP primal objective
minu
M(u) = minu
L(u) =∑
y
α∗y f (y) +∑
z
β∗zg(z)
![Page 121: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/121.jpg)
Primal relationshipdefine:
• Q ⊆ ∆y ×∆z corresponds to feasible solutions of the originalproblem
Q = {(δy (y), δz(z)): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
• Q′ ⊆ ∆y ×∆z is the set of feasible solutions to the LP
Q′ = {(α, β): α ∈ ∆Y , β ∈ ∆Z ,∑y αyy(i , t) =
∑z βzz(i , t) for all (i , t)}
• Q ⊆ Q′
solutions:maxq∈Q
h(q) ≤ maxq∈Q′
h(q) for any h
![Page 122: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/122.jpg)
Concrete example
• Y = {y1, y2, y3}• Z = {z1, z2, z3}• ∆y ⊂ R3, ∆z ⊂ R3
Yx
a
He
a
is
y1
x
b
He
b
is
y2
x
c
He
c
is
y3
Z a
He
b
is
z1
b
He
a
is
z2
c
He
c
is
z3
![Page 123: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/123.jpg)
Simple solution
Yx
a
He
a
is
y1
x
b
He
b
is
y2
x
c
He
c
is
y3
Z a
He
b
is
z1
b
He
a
is
z2
c
He
c
is
z3
choose:• α(1) = (0, 0, 1) ∈ ∆y is representation of y3
• β(1) = (0, 0, 1) ∈ ∆z is representation of z3
confirm: ∑
y
α(1)y y(i , t) =
∑
z
β(1)z z(i , t)
α(1) and β(1) satisfy agreement constraint
![Page 124: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/124.jpg)
Fractional solution
Yx
a
He
a
is
y1
x
b
He
b
is
y2
x
c
He
c
is
y3
Z a
He
b
is
z1
b
He
a
is
z2
c
He
c
is
z3
choose:• α(2) = (0.5, 0.5, 0) ∈ ∆y is combination of y1 and y2
• β(2) = (0.5, 0.5, 0) ∈ ∆z is combination of z1 and z2
confirm: ∑
y
α(2)y y(i , t) =
∑
z
β(2)z z(i , t)
α(2) and β(2) satisfy agreement constraint, but not integral
![Page 125: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/125.jpg)
Optimal solution
weights:
• the choice of f and g determines the optimal solution
• if (f , g) favors (α(2), β(2)), the optimal solution is fractional
example: f = [1 1 2] and g = [1 1 − 2]
• f · α(1) + g · β(1) = 0 vs f · α(2) + g · β(2) = 2
• α(2), β(2) is optimal, even though it is fractional
summary: dual and LP primal optimal:
minu
M(u) = minu
L(u) =∑
y
α(2)y f (y) +
∑
z
β(2)z g(z) = 2
original primal optimal:
f (y∗) + g(z∗) = 0
![Page 126: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/126.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (1) 2.00
z(1) 1.00
L(u(1)) 3.00
previous solutions:y3 z2
y2 z1y1 z1y1 z1y2 z2y1 z1y2 z2y1 z1y2 z2
![Page 127: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/127.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (2) 2.00
z(2) 1.00
L(u(2)) 3.00
previous solutions:y3 z2y2 z1
y1 z1y1 z1y2 z2y1 z1y2 z2y1 z1y2 z2
![Page 128: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/128.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (3) 2.50
z(3) 0.50
L(u(3)) 3.00
previous solutions:y3 z2y2 z1y1 z1
y1 z1y2 z2y1 z1y2 z2y1 z1y2 z2
![Page 129: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/129.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (4) 2.17
z(4) 0.17
L(u(4)) 2.33
previous solutions:y3 z2y2 z1y1 z1y1 z1
y2 z2y1 z1y2 z2y1 z1y2 z2
![Page 130: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/130.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (5) 2.08
z(5) 0.08
L(u(5)) 2.17
previous solutions:y3 z2y2 z1y1 z1y1 z1y2 z2
y1 z1y2 z2y1 z1y2 z2
![Page 131: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/131.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (6) 2.12
z(6) 0.12
L(u(6)) 2.23
previous solutions:y3 z2y2 z1y1 z1y1 z1y2 z2y1 z1
y2 z2y1 z1y2 z2
![Page 132: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/132.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (7) 2.05
z(7) 0.05
L(u(7)) 2.10
previous solutions:y3 z2y2 z1y1 z1y1 z1y2 z2y1 z1y2 z2
y1 z1y2 z2
![Page 133: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/133.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (8) 2.09
z(8) 0.09
L(u(8)) 2.19
previous solutions:y3 z2y2 z1y1 z1y1 z1y2 z2y1 z1y2 z2y1 z1
y2 z2
![Page 134: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/134.jpg)
round 1
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 2
dual solutions:
x
b
He
b
is
y2
a
He
b
is
z1
round 3
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 4
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 5
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 6
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 7
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
round 8
dual solutions:
x
a
He
a
is
y1
a
He
b
is
z1
round 9
dual solutions:
x
b
He
b
is
y2
b
He
a
is
z2
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
Round
dual values:y (9) 2.03
z(9) 0.03
L(u(9)) 2.06
previous solutions:y3 z2y2 z1y1 z1y1 z1y2 z2y1 z1y2 z2y1 z1y2 z2
![Page 135: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/135.jpg)
Tightening (Sherali and Adams, 1994; Sontag et al., 2008)
modify:
• extend Y, Z to identify bigrams of part-of-speech tags
• y(i , t1, t2) = 1 ↔ y(i , t1) = 1 and y(i + 1, t2) = 1
• z(i , t1, t2) = 1 ↔ z(i , t1) = 1 and z(i + 1, t2) = 1
all bigram constraints: valid to add for all i , t1, t2 ∈ T∑
y
αyy(i , t1, t2) =∑
z
βzz(i , t1, t2)
however this would make decoding expensive
![Page 136: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/136.jpg)
Iterative tighteningsingle bigram constraint: cheaper to implement
∑
y
αyy(1, a, b) =∑
z
βzz(1, a, b)
the solution α(1), β(1) trivially passes this constraint, whileα(2), β(2) violates it
Yx
a
He
a
is
y1
x
b
He
b
is
y2
x
c
He
c
is
y3
Z a
He
b
is
z1
b
He
a
is
z2
c
He
c
is
z3
![Page 137: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/137.jpg)
Dual decomposition with tightening
tightened decomposition includes an additional Lagrange multiplier
yu,v = arg maxy∈Y
f (y) +∑
i ,t
u(i , t)y(i , t) + v(1, a, b)y(1, a, b)
zu,v = arg maxz∈Z
g(z)−∑
i ,t
u(i , t)z(i , t)− v(1, a, b)z(1, a, b)
in general, this term can make the decoding problem more difficult
example:
• for small examples, these penalties are easy to compute
• for CFG parsing, need to include extra states that maintaintag bigrams (still faster than full intersection)
![Page 138: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/138.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (7) 2.00
z(7) 1.00
L(u(7)) 3.00
previous solutions:y3 z2
y2 z3y1 z2y3 z1y2 z3y1 z2y3 z1y2 z3y1 z2y3 z3
![Page 139: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/139.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (8) 3.00
z(8) 2.00
L(u(8)) 5.00
previous solutions:y3 z2y2 z3
y1 z2y3 z1y2 z3y1 z2y3 z1y2 z3y1 z2y3 z3
![Page 140: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/140.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (9) 3.00
z(9) -1.00
L(u(9)) 2.00
previous solutions:y3 z2y2 z3y1 z2
y3 z1y2 z3y1 z2y3 z1y2 z3y1 z2y3 z3
![Page 141: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/141.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (10) 2.00
z(10) 1.00
L(u(10)) 3.00
previous solutions:y3 z2y2 z3y1 z2y3 z1
y2 z3y1 z2y3 z1y2 z3y1 z2y3 z3
![Page 142: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/142.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (11) 3.00
z(11) 2.00
L(u(11)) 5.00
previous solutions:y3 z2y2 z3y1 z2y3 z1y2 z3
y1 z2y3 z1y2 z3y1 z2y3 z3
![Page 143: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/143.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (12) 3.00
z(12) -1.00
L(u(12)) 2.00
previous solutions:y3 z2y2 z3y1 z2y3 z1y2 z3y1 z2
y3 z1y2 z3y1 z2y3 z3
![Page 144: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/144.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (13) 2.00
z(13) -1.00
L(u(13)) 1.00
previous solutions:y3 z2y2 z3y1 z2y3 z1y2 z3y1 z2y3 z1
y2 z3y1 z2y3 z3
![Page 145: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/145.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (14) 3.00
z(14) 2.00
L(u(14)) 5.00
previous solutions:y3 z2y2 z3y1 z2y3 z1y2 z3y1 z2y3 z1y2 z3
y1 z2y3 z3
![Page 146: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/146.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (15) 3.00
z(15) -1.00
L(u(15)) 2.00
previous solutions:y3 z2y2 z3y1 z2y3 z1y2 z3y1 z2y3 z1y2 z3y1 z2
y3 z3
![Page 147: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/147.jpg)
round 7
dual solutions:
x
c
He
c
is
y3
b
He
a
is
z2
round 8
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 9
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 10
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 11
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 12
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 13
dual solutions:
x
c
He
c
is
y3
a
He
b
is
z1
round 14
dual solutions:
x
b
He
b
is
y2
c
He
c
is
z3
round 15
dual solutions:
x
a
He
a
is
y1
b
He
a
is
z2
round 16
dual solutions:
x
c
He
c
is
y3
c
He
c
is
z3
7 8 9 10 11 12 13 14 15 160
1
2
3
4
5
Round
dual values:y (16) 2.00
z(16) -2.00
L(u(16)) 0.00
previous solutions:y3 z2y2 z3y1 z2y3 z1y2 z3y1 z2y3 z1y2 z3y1 z2y3 z3
![Page 148: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/148.jpg)
6. Advanced examples
aim: demonstrate some different relaxation techniques
• higher-order non-projective dependency parsing
• syntactic machine translation
![Page 149: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/149.jpg)
Higher-order non-projective dependency parsing
setup: given a model for higher-order non-projective dependencyparsing (sibling features)
problem: find non-projective dependency parse that maximizes thescore of this model
difficulty:
• model is NP-hard to decode
• complexity of the model comes from enforcing combinatorialconstraints
strategy: design a decomposition that separates combinatorialconstraints from direct implementation of the scoring function
![Page 150: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/150.jpg)
Non-Projective Dependency Parsing
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Important problem in many languages.
Problem is NP-Hard for all but the simplest models.
![Page 151: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/151.jpg)
Dual Decomposition
A classical technique for constructing decoding algorithms.
Solve complicated models
y∗ = arg maxy
f (y)
by decomposing into smaller problems.
Upshot: Can utilize a toolbox of combinatorial algorithms.
I Dynamic programming
I Minimum spanning tree
I Shortest path
I Min-Cut
I ...
![Page 152: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/152.jpg)
Non-Projective Dependency Parsing
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
I Starts at the root symbol *
I Each word has a exactly one parent word
I Produces a tree structure (no cycles)
I Dependencies can cross
![Page 153: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/153.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) =
score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 154: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/154.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2)
+score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 155: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/155.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 156: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/156.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4)
+score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 157: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/157.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 158: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/158.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 159: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/159.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 160: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/160.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 161: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/161.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 162: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/162.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) =
score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 163: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/163.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 164: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/164.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1)
+score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 165: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/165.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 166: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/166.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 167: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/167.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 168: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/168.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 169: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/169.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 170: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/170.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 171: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/171.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 172: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/172.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 173: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/173.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 174: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/174.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 175: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/175.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 176: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/176.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 177: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/177.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 178: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/178.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 179: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/179.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 180: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/180.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 181: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/181.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 182: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/182.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 183: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/183.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 184: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/184.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 185: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/185.jpg)
Dual Decomposition Idea
NoConstraints
TreeConstraints
Arc-Factored
MinimumSpanning Tree
SiblingModel
IndividualDecoding
![Page 186: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/186.jpg)
Dual Decomposition Idea
NoConstraints
TreeConstraints
Arc-Factored
MinimumSpanning Tree
SiblingModel
IndividualDecoding
DualDecomposition
![Page 187: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/187.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 188: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/188.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 189: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/189.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid Trees
All Possible
Sibling Arc-Factored
Constraint
![Page 190: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/190.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 191: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/191.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling
Arc-Factored
Constraint
![Page 192: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/192.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 193: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/193.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 194: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/194.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 195: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/195.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 196: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/196.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 197: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/197.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 198: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/198.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
Else Update penalty weights based on y (k)(i , j)− z(k)(i , j)
![Page 199: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/199.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 200: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/200.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 201: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/201.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 202: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/202.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 203: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/203.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 204: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/204.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 205: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/205.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 206: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/206.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 207: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/207.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 208: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/208.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 209: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/209.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 210: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/210.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑
i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑
i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 211: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/211.jpg)
Guarantees
TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global
optimum.
In experiments, we find the global optimum on 98% of examples.
If we do not converge to a match, we can still return anapproximate solution (more in the paper).
![Page 212: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/212.jpg)
Guarantees
TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global
optimum.
In experiments, we find the global optimum on 98% of examples.
If we do not converge to a match, we can still return anapproximate solution (more in the paper).
![Page 213: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/213.jpg)
Extensions
I Grandparent Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) =...+ score(gp =∗0, head = saw2, prev =movie4,mod =today5)
I Head Automata (Eisner, 2000)
Generalization of Sibling models
Allow arbitrary automata as local scoring function.
![Page 214: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/214.jpg)
ExperimentsProperties:
I Exactness
I Parsing Speed
I Parsing Accuracy
I Comparison to Individual Decoding
I Comparison to LP/ILP
Training:I Averaged Perceptron (more details in paper)
Experiments on:
I CoNLL Datasets
I English Penn Treebank
I Czech Dependency Treebank
![Page 215: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/215.jpg)
How often do we exactly solve the problem?
90
92
94
96
98
100
CzeEng
DanDut
PorSlo
SweTur
I Percentage of examples where the dual decomposition findsan exact solution.
![Page 216: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/216.jpg)
Parsing Speed
0
10
20
30
40
50
CzeEng
DanDut
PorSlo
SweTur
Sibling model
0
5
10
15
20
25
CzeEng
DanDut
PorSlo
SweTur
Grandparent model
I Number of sentences parsed per second
I Comparable to dynamic programming for projective parsing
![Page 217: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/217.jpg)
Accuracy
Arc-Factored Prev Best Grandparent
Dan 89.7 91.5 91.8Dut 82.3 85.6 85.8Por 90.7 92.1 93.0Slo 82.4 85.6 86.2Swe 88.9 90.6 91.4Tur 75.7 76.4 77.6Eng 90.1 — 92.5Cze 84.4 — 87.3
Prev Best - Best reported results for CoNLL-X data set, includes
I Approximate search (McDonald and Pereira, 2006)
I Loop belief propagation (Smith and Eisner, 2008)
I (Integer) Linear Programming (Martins et al., 2009)
![Page 218: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/218.jpg)
Comparison to Subproblems
88
89
90
91
92
93
Eng
IndividualMSTDual
F1 for dependency accuracy
![Page 219: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/219.jpg)
Comparison to LP/ILP
Martins et al.(2009): Proposes two representations ofnon-projective dependency parsing as a linear programmingrelaxation as well as an exact ILP.
I LP (1)I LP (2)I ILP
Use an LP/ILP Solver for decoding
We compare:
I AccuracyI ExactnessI Speed
Both LP and dual decomposition methods use the same model,features, and weights w .
![Page 220: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/220.jpg)
Comparison to LP/ILP: Accuracy
80
85
90
95
100LP(1)LP(2)
ILPDual
Dependency Accuracy
I All decoding methods have comparable accuracy
![Page 221: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/221.jpg)
Comparison to LP/ILP: Exactness and Speed
80
85
90
95
100LP(1)LP(2)
ILPDual
Percentage with exact solution
0
2
4
6
8
10
12
14LP(1)LP(2)
ILPDual
Sentences per second
![Page 222: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/222.jpg)
Syntactic translation decoding
setup: assume a trained model for syntactic machine translation
problem: find best derivation that maximizes the score of thismodel
difficulty:
• need to incorporate language model in decoding
• empirically, relaxation is often not tight, so dualdecomposition does not always converge
strategy:
• use a different relaxation to handle language model
• incrementally add constraints to find exact solution
![Page 223: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/223.jpg)
Syntactic TranslationProblem:
Decoding synchronous grammar for machine translation
Example:
<s> abarks le dug </s>
<s> the dog barks loudly </s>
Goal:y∗ = arg max
yf (y)
where y is a parse derivation in a synchronous grammar
![Page 224: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/224.jpg)
Hiero Example
Consider the input sentence
<s> abarks le dug </s>
And the synchronous grammar
S → <s> X </s>, <s> X </s>
X → abarks X, X barks loudlyX → abarks X, barks XX → abarks X, barks X loudlyX → le dug, the dogX → le dug, a cat
![Page 225: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/225.jpg)
Hiero ExampleApply synchronous rules to map this sentence
S
<s> X
abarks X
le dug
</s>
S
<s> X
X
the dog
barks loudly
</s>
Many possible mappings:
<s> the dog barks loudly </s>
<s> a cat barks loudly </s>
<s> barks the dog </s>
<s> barks a cat </s><s> barks the dog loudly </s>
<s> barks a cat loudly </s>
![Page 226: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/226.jpg)
Translation ForestRule Score
1 → <s> 4 </s> -14 → 5 barks loudly 24 → barks 5 0.54 → barks 5 loudly 35 → the dog -45 → a cat 2.5
Example: a derivation in the translation forest
1
<s> 4
5
a cat
barks loudly
</s>
![Page 227: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/227.jpg)
Scoring functionScore : sum of hypergraph derivation and language model
1
<s> 4
5
a cat
barks loudly
</s>
f (y) = score(5→ a cat)
![Page 228: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/228.jpg)
Scoring functionScore : sum of hypergraph derivation and language model
1
<s> 4
5
a cat
barks loudly
</s>
f (y) = score(5→ a cat) + score(4→ 5 barks loudly)
![Page 229: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/229.jpg)
Scoring functionScore : sum of hypergraph derivation and language model
1
<s> 4
5
a cat
barks loudly
</s>
f (y) = score(5→ a cat) + score(4→ 5 barks loudly) + . . .
+score(<s>, the)
![Page 230: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/230.jpg)
Scoring functionScore : sum of hypergraph derivation and language model
1
<s> 4
5
a cat
barks loudly
</s>
f (y) = score(5→ a cat) + score(4→ 5 barks loudly) + . . .
+score(<s>, a) + score(a, cat)
![Page 231: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/231.jpg)
Exact Dynamic ProgrammingTo maximize combined model, need to ensure that bigrams areconsistent with parse tree.
1
<s> 4
5
a cat
barks loudly
</s>
![Page 232: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/232.jpg)
Exact Dynamic ProgrammingTo maximize combined model, need to ensure that bigrams areconsistent with parse tree.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
loudly
barkscat<s> cat
<s> loudly
<s> </s>
Original Rules
5 → the dog5 → a cat
New Rules
<s>5cat → <s>thethe thedogdogbarks5cat → barksthethe thedogdog<s>5cat → <s>aa acatcatbarks5cat → barksaa acatcat
![Page 233: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/233.jpg)
Lagrangian Relaxation Algorithm forSyntactic Translation
Outline:
• Algorithm for simplified version of translation
• Full algorithm with certificate of exactness
• Experimental results
![Page 234: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/234.jpg)
Thought experiment: Greedy language modelChoose best bigram for a given word
barks
<s>
dog
cat
• score(<s>, barks)
![Page 235: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/235.jpg)
Thought experiment: Greedy language modelChoose best bigram for a given word
barks
<s>
dog
cat
• score(<s>, barks)
• score(dog, barks)
![Page 236: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/236.jpg)
Thought experiment: Greedy language modelChoose best bigram for a given word
barks
<s>
dog
cat
• score(<s>, barks)
• score(dog, barks)
• score(cat, barks)
![Page 237: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/237.jpg)
Thought experiment: Greedy language modelChoose best bigram for a given word
barks
<s>
dog
cat
• score(<s>, barks)
• score(dog, barks)
• score(cat, barks)
Can compute with a simple maximization
arg maxw :〈w ,barks〉∈B
score(w , barks)
![Page 238: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/238.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks
![Page 239: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/239.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog
![Page 240: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/240.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks
![Page 241: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/241.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks <s>
![Page 242: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/242.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks <s> the
![Page 243: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/243.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks <s> the <s>
![Page 244: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/244.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
![Page 245: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/245.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
Step 2. Find the best derivation with fixed bigrams
![Page 246: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/246.jpg)
Thought experiment: Full decodingStep 1. Greedily choose best bigram for each word
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
Step 2. Find the best derivation with fixed bigrams
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
barks
![Page 247: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/247.jpg)
Thought Experiment Problem
May produce invalid parse and bigram relationship
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
barks
Greedy bigram selection may conflict with the parse derivation
![Page 248: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/248.jpg)
Thought Experiment Problem
May produce invalid parse and bigram relationship
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
barks
Greedy bigram selection may conflict with the parse derivation
![Page 249: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/249.jpg)
Formal objectiveNotation: y(w , v) = 1 if the bigram 〈w , v〉 ∈ B is in y
Goal:arg max
y∈Yf (y)
such that for all words nodes yv
(1)
v
![Page 250: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/250.jpg)
Formal objectiveNotation: y(w , v) = 1 if the bigram 〈w , v〉 ∈ B is in y
Goal:arg max
y∈Yf (y)
such that for all words nodes yv
yv =∑
w :〈w ,v〉∈By(w , v) (1)
v
vw
![Page 251: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/251.jpg)
Formal objectiveNotation: y(w , v) = 1 if the bigram 〈w , v〉 ∈ B is in y
Goal:arg max
y∈Yf (y)
such that for all words nodes yv
yv =∑
w :〈w ,v〉∈By(w , v) (1)
yv =∑
w :〈v ,w〉∈By(v ,w) (2)
v
vw
![Page 252: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/252.jpg)
Formal objectiveNotation: y(w , v) = 1 if the bigram 〈w , v〉 ∈ B is in y
Goal:arg max
y∈Yf (y)
such that for all words nodes yv
yv =∑
w :〈w ,v〉∈By(w , v) (1)
yv =∑
w :〈v ,w〉∈By(v ,w) (2)
v
vw
wv
![Page 253: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/253.jpg)
Formal objectiveNotation: y(w , v) = 1 if the bigram 〈w , v〉 ∈ B is in y
Goal:arg max
y∈Yf (y)
such that for all words nodes yv
yv =∑
w :〈w ,v〉∈By(w , v) (1)
yv =∑
w :〈v ,w〉∈By(v ,w) (2)
Lagrangian: Relax constraint (2), leave constraint (1)
L(u, y) = maxy∈Y
f (y) +∑
w ,v
u(v)
yv −
∑
w :〈v ,w〉∈By(v ,w)
For a given u, L(u, y) can be solved by our greedy LM algorithm
v
vw
wv
![Page 254: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/254.jpg)
AlgorithmSet u(1)(v) = 0 for all v ∈ VL
For k = 1 to K
y (k) ← arg maxy∈Y
L(k)(u, y)
If y(k)v =
∑
w :〈v ,w〉∈By (k)(v ,w) for all v Return (y (k))
Else
u(k+1)(v)← u(k)(v)− αk
y
(k)v −
∑
w :〈v ,w〉∈By (k)(v ,w)
![Page 255: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/255.jpg)
Thought experiment: Greedy with penaltiesChoose best bigram with penalty for a given word
barks
<s>
dog
cat
• score(<s>, barks)− u(<s>) + u(barks)
![Page 256: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/256.jpg)
Thought experiment: Greedy with penaltiesChoose best bigram with penalty for a given word
barks
<s>
dog
cat
• score(<s>, barks)− u(<s>) + u(barks)
• score(cat, barks)− u(cat) + u(barks)
![Page 257: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/257.jpg)
Thought experiment: Greedy with penaltiesChoose best bigram with penalty for a given word
barks
<s>
dog
cat
• score(<s>, barks)− u(<s>) + u(barks)
• score(cat, barks)− u(cat) + u(barks)
• score(dog, barks)− u(dog) + u(barks)
![Page 258: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/258.jpg)
Thought experiment: Greedy with penaltiesChoose best bigram with penalty for a given word
barks
<s>
dog
cat
• score(<s>, barks)− u(<s>) + u(barks)
• score(cat, barks)− u(cat) + u(barks)
• score(dog, barks)− u(dog) + u(barks)
Can still compute with a simple maximization over
arg maxw :〈w ,barks〉∈B
score(w , barks)− u(w) + u(barks)
![Page 259: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/259.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 0 0 0 0 0 0
Greedy decoding
![Page 260: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/260.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 0 0 0 0 0 0
Greedy decoding
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
![Page 261: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/261.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 0 0 0 0 0 0
Greedy decoding
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
barks
![Page 262: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/262.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 0 0 0 0 0 0
Greedy decoding
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
barks
![Page 263: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/263.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -1 0 1
Greedy decoding
</s> barks loudly the dog a cat
barks dog barks <s> the <s> a
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
barks
![Page 264: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/264.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -1 0 1
Greedy decoding
![Page 265: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/265.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -1 0 1
Greedy decoding
</s> barks loudly the dog a cat
loudly cat barks <s> the <s> a
![Page 266: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/266.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -1 0 1
Greedy decoding
</s> barks loudly the dog a cat
loudly cat barks <s> the <s> a
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
![Page 267: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/267.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -1 0 1
Greedy decoding
</s> barks loudly the dog a cat
loudly cat barks <s> the <s> a
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 268: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/268.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -0.5 0 0.5
Greedy decoding
</s> barks loudly the dog a cat
loudly cat barks <s> the <s> a
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 269: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/269.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -0.5 0 0.5
Greedy decoding
![Page 270: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/270.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -0.5 0 0.5
Greedy decoding
</s> barks loudly the dog a cat
loudly dog barks <s> the <s> a
![Page 271: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/271.jpg)
Algorithm examplePenalties
v </s> barks loudly the dog a catu(v) 0 -1 1 0 -0.5 0 0.5
Greedy decoding
</s> barks loudly the dog a cat
loudly dog barks <s> the <s> a
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
dog barks
loudly
![Page 272: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/272.jpg)
Constraint IssueConstraints do not capture all possible reorderings
Example: Add rule 〈5→ cat a〉 to forest. New derivation
1
<s> 4
5
cat a
barks loudly
</s>
<s>a
cat barks
loudly
Satisfies both constraints (1) and (2), but is not self-consistent.
![Page 273: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/273.jpg)
Constraint IssueConstraints do not capture all possible reorderings
Example: Add rule 〈5→ cat a〉 to forest. New derivation
1
<s> 4
5
cat a
barks loudly
</s>
<s>a
cat barks
loudly
Satisfies both constraints (1) and (2), but is not self-consistent.
![Page 274: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/274.jpg)
New Constraints: Paths1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
Fix: In addition tobigrams, consider pathsbetween terminal nodes
Example: Path marker〈5 ↓, 10 ↓〉 implies thatbetween two word nodes,we move down from node5 to node 10
![Page 275: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/275.jpg)
New Constraints: Paths
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
Fix: In addition tobigrams, consider pathsbetween terminal nodes
Example: Path marker〈5 ↓, 10 ↓〉 implies thatbetween two word nodes,we move down from node5 to node 10
![Page 276: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/276.jpg)
New Constraints: Paths
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
Fix: In addition tobigrams, consider pathsbetween terminal nodes
Example: Path marker〈5 ↓, 10 ↓〉 implies thatbetween two word nodes,we move down from node5 to node 10
![Page 277: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/277.jpg)
New Constraints: Paths
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
Fix: In addition tobigrams, consider pathsbetween terminal nodes
Example: Path marker〈5 ↓, 10 ↓〉 implies thatbetween two word nodes,we move down from node5 to node 10
![Page 278: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/278.jpg)
New Constraints: Paths
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
1
<s> 4
5
a cat
barks loudly
</s>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
Fix: In addition tobigrams, consider pathsbetween terminal nodes
Example: Path marker〈5 ↓, 10 ↓〉 implies thatbetween two word nodes,we move down from node5 to node 10
![Page 279: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/279.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 280: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/280.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 281: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/281.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 282: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/282.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 283: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/283.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 284: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/284.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 285: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/285.jpg)
Greedy Language Model with PathsStep 1. Greedily choose best path each word
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
</s> barks loudly the dog a cat
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< the ↓>
< 5 ↓, the ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< dog ↓>
< the ↑,dog ↓>
< the ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 286: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/286.jpg)
Greedy Language Model with Paths (continued)
Step 2. Find the best derivation over these elements
1
<s> 4
5
a cat
barks loudly
</s>
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑,barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 287: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/287.jpg)
Greedy Language Model with Paths (continued)
Step 2. Find the best derivation over these elements
1
<s> 4
5
a cat
barks loudly
</s>
< </s> ↓>
< 4 ↑, </s> ↓>
< loudly ↑, 4 ↓>
< loudly ↑>
< barks ↓>
< 5 ↑, barks ↓>
< cat ↑, 5 ↑>
< cat ↑>
< loudly ↓>
< loudly ↓, barks ↑>
< barks ↑>
< a ↓>
< 5 ↓, a ↓>
< 4 ↓, 5 ↓>
< <s> ↑, 4 ↓>
< <s> ↑>
< cat ↓>
< a ↑, cat ↓>
< a ↑>
![Page 288: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/288.jpg)
Efficiently Calculating Best PathsThere are too many paths to compute argmax directly, but we cancompactly represent all paths as a graph
< 3 ↑, 1 ↑>
< 5 ↓, 10 ↓>
< 5 ↑, 6 ↓>
< 4 ↑, 3 ↓>
< 11 ↓>
< 3 ↓>
< 5 ↓, 8 ↓>
< 2 ↑> < 10 ↑>< 8 ↑>
< 8 ↓>
< 9 ↑, 5 ↑>
< 6 ↑, 5 ↓>
< 6 ↑>
< 6 ↑, 7 ↓>
< 10 ↑, 11 ↓>< 7 ↑>
< 4 ↓, 5 ↓>
< 11 ↑>< 9 ↑>
< 11 ↑, 5 ↑>< 2 ↑, 4 ↓>
< 4 ↓, 6 ↓> < 5 ↑, 4 ↑>
< 10 ↓>
< 6 ↓>
< 7 ↑, 4 ↑>
< 7 ↓>
< 5 ↑, 7 ↓>
< 3 ↑>
< 9 ↓>
< 8 ↑, 9 ↓>
Graph is linear in the size of the grammar
• Green nodes represent leaving a word
• Red nodes represent entering a word
• Black nodes are intermediate paths
![Page 289: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/289.jpg)
Best Paths
< 3 ↑, 1 ↑>
< 5 ↓, 10 ↓>
< 5 ↑, 6 ↓>
< 4 ↑, 3 ↓>
< 11 ↓>
< 3 ↓>
< 5 ↓, 8 ↓>
< 2 ↑> < 10 ↑>< 8 ↑>
< 8 ↓>
< 9 ↑, 5 ↑>
< 6 ↑, 5 ↓>
< 6 ↑>
< 6 ↑, 7 ↓>
< 10 ↑, 11 ↓>< 7 ↑>
< 4 ↓, 5 ↓>
< 11 ↑>< 9 ↑>
< 11 ↑, 5 ↑>< 2 ↑, 4 ↓>
< 4 ↓, 6 ↓> < 5 ↑, 4 ↑>
< 10 ↓>
< 6 ↓>
< 7 ↑, 4 ↑>
< 7 ↓>
< 5 ↑, 7 ↓>
< 3 ↑>
< 9 ↓>
< 8 ↑, 9 ↓>Goal: Find the bestpath between allword nodes (greenand red)
Method: Runall-pairs shortestpath to find bestpaths
![Page 290: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/290.jpg)
Full Algorithm
Algorithm is very similar to simple bigram case. Penalty weights areassociated with nodes in the graph instead of just bigram words
TheoremIf at any iteration the greedy paths agree with the derivation,
then (y (k)) is the global optimum.
But what if it does not find the global optimum?
![Page 291: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/291.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 292: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/292.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 293: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/293.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 294: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/294.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 295: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/295.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 296: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/296.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 297: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/297.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 298: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/298.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
![Page 299: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/299.jpg)
ConvergenceThe algorithm is not guaranteed to converge
May get stuck between solutions.
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
Can fix this by incrementally adding constraints to the problem
![Page 300: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/300.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
Partitions
A = {2,6,7,8,9,10,11}B = {}
![Page 301: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/301.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
Partitions
A = {2,6,7,8,9,10,11}B = {}
![Page 302: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/302.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
a cat
barks loudly
</s>
<s> a
dog barks
loudly
Partitions
A = {2,6,7,8,9,10,11}B = {}
![Page 303: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/303.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
Partitions
A = {2,6,7,8,9,10,11}B = {}
![Page 304: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/304.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
Partitions
A = {2,6,7,8,9,10}B = {11}
![Page 305: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/305.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
the dog
barks loudly
</s>
<s> the
cat barks
loudly
Partitions
A = {2,6,7,8,9,10}B = {11}
![Page 306: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/306.jpg)
TighteningMain idea: Keep partition sets (A and B). The parser treats allwords in a partition as the same word.
• Initially place all words in the same partition.
• If the algorithm gets stuck, separate words that conflict
• Run the exact algorithm but only distinguish betweenpartitions (much faster than running full exact algorithm)
Example:
1
<s> 4
5
the dog
barks loudly
</s>
A A
A
ABA B
A A
A A
<s> the
dog barks
loudly
Partitions
A = {2,6,7,8,9,10}B = {11}
![Page 307: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/307.jpg)
Experiments
Properties:
• Exactness
• Translation Speed
• Comparison to Cube Pruning
Model:
• Tree-to-String translation model (Huang and Mi, 2010)
• Trained with MERT
Experiments:
• NIST MT Evaluation Set (2008)
![Page 308: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/308.jpg)
Exactness
50
60
70
80
90
100
Percent Exact
LRILPDPLP
LR Lagrangian RelaxationILP Integer Linear ProgrammingDP Exact Dynanic ProgrammingLP Linear Programming
![Page 309: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/309.jpg)
Median Speed
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Sentences Per Second
LRILPDPLP
LR Lagrangian RelaxationILP Integer Linear ProgrammingDP Exact Dynanic ProgrammingLP Linear Programming
![Page 310: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/310.jpg)
Comparison to Cube Pruning: Exactness
40
50
60
70
80
90
100
Percent Exact
LRCube(50)
Cube(500)
LR Lagrangian RelaxationCube(50) Cube Pruning (Beam=50)Cube(500) Cube Pruning (Beam=500)
![Page 311: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/311.jpg)
Comparison to Cube Pruning: Median Speed
0
5
10
15
20
Sentences Per Second
LRCube(50)
Cube(500)
LR Lagrangian RelaxationCube(50) Cube Pruning (Beam=50)Cube(500) Cube Pruning (Beam=500)
![Page 312: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/312.jpg)
The Phrase-Based Decoding ProblemI We have a source-language sentence x1, x2, . . . , xN
(xi is the i’th word in the sentence)
I A phrase p is a tuple (s, t, e) signifying that words xs . . . xthave a target-language translation as e
I E.g., p = (2, 5, the dog) specifies that words x2 . . . x5 have atranslation as the dog
I Output from a phrase-based model is a derivation
y = p1p2 . . . pL
where pj for j = 1 . . . L are phrases. A derivation defines atranslation e(y) formed by concatenating the strings
e(p1)e(p2) . . . e(pL)
![Page 313: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/313.jpg)
Scoring Derivations
I Each phrase p has a score g(p).
I For two consecutive phrases pk = (s, t, e) andpk+1 = (s′, t′, e′), the distortion distance isδ(t, s′) = |t+ 1− s′|
I The score for a derivation is
f(y) = h(e(y)) +L∑
k=1
g(pk) +L−1∑
k=1
η × δ(t(pk), s(pk+1))
where η ∈ R is the distortion penalty, and h(e(y)) is thelanguage model score
![Page 314: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/314.jpg)
The Decoding Problem
I Y is the set of all valid derivations
I For a derivation y, y(i) is the number of times word i istranslated
I A derivation y = p1, p2, . . . , pL is valid if:
I y(i) = 1 for i = 1 . . . NI For each pair of consecutive phrases pk, pk+1 fork = 1 . . . L− 1, we have δ(t(pk), s(pk+1)) ≤ d, where d is thedistortion limit.
I Decoding problem is to find
argmaxy∈Y
f(y)
![Page 315: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/315.jpg)
Exact Dynamic ProgrammingI We can find
argmaxy∈Y
f(y)
using dynamic programming
I But, the runtime (and number of states) is exponential in N .
I Dynamic programming states are of the form
(w1, w2, b, r)
where
I w1, w2 are last two words of a hypothesisI b is a bit-string of length N , recording which words have been
translated (2N possibilities)I r is the end-point of the last phrase in the hypothesis
![Page 316: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/316.jpg)
A Lagrangian Relaxation Algorithm
I Define Y ′ to be the set of derivations such that:
I∑N
i=1 y(i) = NI For each pair of consecutive phrases pk, pk+1 fork = 1 . . . L− 1, we have δ(t(pk), s(pk+1)) ≤ d, where d is thedistortion limit.
I Notes:
I We have dropped the y(i) = 1 constraints.I We have Y ⊂ Y ′
![Page 317: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/317.jpg)
Dynamic Programming over Y ′
I We can findargmax
y∈Y ′f(y)
efficiently, using dynamic programming
I Dynamic programming states are of the form
(w1, w2, n, r)
where
I w1, w2 are last two words of a hypothesisI n is the length of the partial hypothesisI r is the end-point of the last phrase in the hypothesis
![Page 318: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/318.jpg)
A Lagrangian Relaxation Algorithm (continued)
I The original decoding problem is
argmaxy∈Y
f(y)
I We can rewrite this as
argmaxy∈Y ′
f(y) such that ∀i, y(i) = 1
I We deal with the y(i) = 1 constraints using Lagrangianrelaxation
![Page 319: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/319.jpg)
A Lagrangian Relaxation Algorithm (continued)
The Lagrangian is
L(u, y) = f(y) +∑
i
u(i)(y(i)− 1)
The dual objective is then
L(u) = maxy∈Y ′
L(u, y).
and the dual problem is to solve
minuL(u).
![Page 320: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/320.jpg)
The Algorithm
Initialization: u0(i)← 0 for i = 1 . . . N
for t = 1 . . . T
yt = argmaxy∈Y ′ L(ut−1, y)
if yt(i) = 1 for i = 1 . . . N
return yt
else
for i = 1 . . . N
ut(i) = ut−1(i)− αt(yt(i)− 1
)
Figure: The decoding algorithm. αt > 0 is the step size at the t’thiteration.
![Page 321: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/321.jpg)
An Example Run of the Algorithm
Input German: dadurch konnen die qualitat und die regelmaßige postzustellung auch weiterhin sichergestellt werden .
t L(ut!1) yt(i) derivation yt
1 -10.0988 0 0 2 2 3 3 0 0 2 0 0 0 1˛
3, 6the quality and
˛9, 9also
˛6, 6the
˛5, 5and
˛3, 3the
˛4, 6
quality and
˛9, 9also
˛13, 13
.
˛2 -11.1597 0 0 1 0 0 0 1 0 0 4 1 5 1
˛3, 3the
˛7, 7
regular
˛12, 12
will
˛10, 10
continue to
˛12, 12
be
˛10, 10
continue to
˛12, 12
be
˛10, 10
continue to
˛12, 12
be
˛10, 10
continue to
˛11, 13
be guaranteed .
˛3 -12.3742 3 3 1 2 2 0 0 0 1 0 0 0 1
˛1, 2
in that way ,
˛5, 5and
˛2, 2can
˛1, 1thus
˛4, 4
quality
˛1, 2
in that way ,
˛3, 5
the quality and
˛9, 9also
˛13, 13
.
˛4 -11.8623 0 1 0 0 0 1 1 3 3 0 3 0 1
˛2, 2can
˛6, 7
the regular
˛8, 8
distribution should
˛9, 9also
˛11, 11ensure
˛8, 8
distribution should
˛9, 9also
˛11, 11ensure
˛8, 8
distribution should
˛9, 9also
˛11, 11ensure
˛13, 13
.
˛5 -13.9916 0 0 1 1 3 2 4 0 0 0 1 0 1
˛3, 3the
˛7, 7
regular
˛5, 5and
˛7, 7
regular
˛5, 5and
˛7, 7
regular
˛6, 6the
˛4, 4
quality
˛5, 7
and the regular
˛11, 11
ensured
˛13, 13
.
˛6 -15.6558 1 1 1 2 0 2 0 1 1 1 1 1 1
˛1, 2
in that way ,
˛3, 4
the quality of
˛6, 6the
˛4, 4
quality of
˛6, 6the
˛8, 8
distribution should
˛9, 10
continue to
˛11, 13
be guaranteed .
˛7 -16.1022 1 1 1 1 1 1 1 1 1 1 1 1 1
˛1, 2
in that way ,
˛3, 4
the quality
˛5, 7
and the regular
˛8, 8
distribution should
˛9, 10
continue to
˛11, 13
be guaranteed .
˛
Figure 2: An example run of the algorithm in Figure 1. For each value of t we show the dual value L(ut!1), thederivation yt, and the number of times each word is translated, yt(i) for i = 1 . . . N . For each phrase in a derivationwe show the English string e, together with the span (s, t): for example, the first phrase in the first derivation hasEnglish string the quality and, and span (3, 6). At iteration 7 we have yt(i) = 1 for i = 1 . . . N , and the translation isreturned, with a guarantee that it is optimal.
The dual objective is then
L(u) = maxy!Y !
L(u, y).
and the dual problem is to solve
minu
L(u).
The next section gives a number of formal results de-scribing how solving the dual problem will be usefulin solving the original optimization problem.
We now describe an algorithm that solves the dualproblem. By standard results for Lagrangian re-laxation (Korte and Vygen, 2008), L(u) is a con-vex function; it can be minimized by a subgradientmethod. If we define
yu = argmaxy!Y !
L(u, y)
and !u(i) = yu(i) ! 1 for i = 1 . . . N , then !u isa subgradient of L(u) at u. A subgradient methodis an iterative method for minimizing L(u), whichperfoms updates ut " ut"1!"t!ut"1 where "t > 0is the step size for the t’th subgradient step.
Figure 1 depicts the resulting algorithm. At eachiteration, we solve
argmaxy!Y !
!f(y) +
"
i
u(i)(y(i)! 1)
#
=argmaxy!Y !
!f(y) +
"
i
u(i)y(i)
#
by the dynamic program described in the previoussection. Incorporating the
$i u(i)y(i) terms in the
dynamic program is straightforward: we simply re-define the phrase scores as
g#(s, t, e) = g(s, t, e) +t"
i=s
u(i)
Intuitively, each Lagrange multiplier u(i) penal-izes or rewards phrases that translate word i; the al-gorithm attempts to adjust the Lagrange multipliersin such a way that each word is translated exactlyonce. The updates ut(i) = ut"1(i) ! "t(yt(i) ! 1)will decrease the value for u(i) if yt(i) > 1, in-crease the value for u(i) if yt(i) = 0, and leave u(i)unchanged if yt(i) = 1.
4.3 PropertiesWe now give some theorems stating formal prop-erties of the Lagrangian relaxation algorithm. Theproofs are simple, and are given in the supplementalmaterial for this submission. First, define y$ to bethe optimal solution for our original problem:
Definition 1. y$ = argmaxy!Y f(y)
Our first theorem states that the dual function pro-vides an upper bound on the score for the optimaltranslation, f(y$):
Theorem 1. For any value of u # RN , L(u) $f(y$).
The second theorem states that under an appropri-ate choice of the step sizes "t, the method converges
![Page 322: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/322.jpg)
Tightening the RelaxationI In some cases, the relaxation is not tight, and the algorithm
will not converge to y(i) = 1 for i = 1 . . . NI Our solution: incrementally add hard constraints until the
relaxation is tightI Definition: for any set C ⊆ {1, 2, . . . , N},
Y ′C = {y : y ∈ Y ′, and ∀i ∈ C, y(i) = 1}
I We can findargmax
y∈Y ′Cf(y)
using dynamic programming, with a 2|C| increase in thenumber of states
I Goal: find a small set C such that Lagrangian relaxation withY ′C returns an exact solution
![Page 323: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/323.jpg)
An Example Run of the Algorithm
Input German: es bleibt jedoch dabei , dass kolumbien ein land ist , das aufmerksam beobachtet werden muss .
t L(ut!1) yt(i) derivation yt
1 -11.8658 0 0 0 0 1 3 0 3 3 4 1 1 0 0 0 0 1˛5, 6that
˛10, 10
is
˛8, 9
a country
˛6, 6that
˛10, 10
is
˛8, 9
a country
˛6, 6that
˛10, 10
is
˛8, 8
a
˛9, 12
country that
˛17, 17
.
˛2 -5.46647 2 2 4 0 2 0 1 0 0 0 1 0 1 1 1 1 1
˛3, 3
however ,
˛1, 1
it
˛2, 3
is , however
˛5, 5
,
˛3, 3
however ,
˛1, 1
it
˛2, 3
is , however
˛5, 5
,
˛7, 7
colombia
˛11, 11
,
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛...
32 -17.0203 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 1 1˛
1, 5nonetheless ,
˛7, 7
colombia
˛10, 10
is
˛8, 8
a
˛9, 12
country that
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛33 -17.1727 1 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛6, 6that
˛8, 9
a country
˛6, 6that
˛7, 7
colombia
˛11, 12, which
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛34 -17.0203 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛7, 7
colombia
˛10, 10
is
˛8, 8
a
˛9, 12
country that
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛35 -17.1631 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛7, 7
colombia
˛10, 10
is
˛8, 8
a
˛9, 12
country that
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛36 -17.0408 1 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛6, 6that
˛8, 9
a country
˛6, 6that
˛7, 7
colombia
˛11, 12, which
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛37 -17.1727 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛7, 7
colombia
˛10, 10
is
˛8, 8
a
˛9, 12
country that
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛38 -17.0408 1 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛6, 6that
˛8, 9
a country
˛6, 6that
˛7, 7
colombia
˛11, 12, which
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛39 -17.1658 1 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛6, 6that
˛8, 9
a country
˛6, 6that
˛7, 7
colombia
˛11, 12, which
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛40 -17.056 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛7, 7
colombia
˛10, 10
is
˛8, 8
a
˛9, 12
country that
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛41 -17.1732 1 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 1
˛1, 5
nonetheless ,
˛6, 6that
˛8, 9
a country
˛6, 6that
˛7, 7
colombia
˛11, 12, which
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛
0 0 0 0 0 • 0 0 0 • 0 0 0 0 0 0 0 count(6) = 10; count(10) = 10; count(i) = 0 for all other iadding constraints: 6 10
42 -17.229 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1˛
1, 5nonetheless ,
˛7, 7
colombia
˛6, 6that
˛8, 12
a country that
˛16, 16must
˛13, 15
be closely monitored
˛17, 17
.
˛
Figure 1: An example run of the algorithm in Figure 3. At iteration 32, we start the K = 10 iterations to countwhich constraints are violated most often. After K iterations, the count for 6 and 10 is 10, and all other constraintshave not been violated during the K iterations. Thus, hard constraints for word 6 and 10 are added. After adding theconstraints, we have yt(i) = 1 for i = 1 . . . N , and the translation is returned, with a guarantee that it is optimal.
!!(s", u) is much cheaper than calculating !C(s, u)directly. Guided by !!(s", u), !C(s, u) can be calcu-lated efficiently by using A* search.
Using the A* algorithm leads to significantimprovements in efficiency when constraints areadded. Section 6 presents comparison of the runningtime with and without A* algorithm.
D Step Size
Similar to ?), we set the step size at the tth iterationto be "t = 1/(1 + #t), where #t is the number oftimes that L(u(t
!) > L(u(t
!#1)) for all t" ! t.
![Page 324: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/324.jpg)
The Algorithm with Constraint Generation
Optimize(C, u)while (dual value still improving)
y∗ = argmaxy∈Y′CL(u, y)
if y∗(i) = 1 for i = 1 . . . N return y∗
else for i = 1 . . . N
u(i) = u(i)− α (y∗(i)− 1)
count(i) = 0 for i = 1 . . . N
for k = 1 . . .K
y∗ = argmaxy∈Y′CL(u, y)
if y∗(i) = 1 for i = 1 . . . N return y∗
else for i = 1 . . . N
u(i) = u(i)− α (y∗(i)− 1)
count(i) = count(i) + [[y∗(i) 6= 1]]
Let C′ = set of G i’s that have the largest value forcount(i) and that are not in Creturn Optimize(C ∪ C′, u)
![Page 325: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/325.jpg)
Number of Constraints Required
# cons. 1-10 words 11-20 words 21-30 words 31-40 words 41-50 words All sentences0-0 183 (98.9 %) 511 (91.6 %) 438 (77.4 %) 222 (64.0 %) 82 ( 48.8 %) 1,436 (78.7 %) 78.7 %1-3 2 ( 1.1 %) 45 ( 8.1 %) 94 (16.6 %) 87 (25.1 %) 50 ( 29.8 %) 278 (15.2 %) 94.0 %4-6 0 ( 0.0 %) 2 ( 0.4 %) 27 ( 4.8 %) 24 ( 6.9 %) 19 ( 11.3 %) 72 ( 3.9 %) 97.9 %7-9 0 ( 0.0 %) 0 ( 0.0 %) 7 ( 1.2 %) 13 ( 3.7 %) 12 ( 7.1 %) 32 ( 1.8 %) 99.7 %x 0 ( 0.0 %) 0 ( 0.0 %) 0 ( 0.0 %) 1 ( 0.3 %) 5 ( 3.0 %) 6 ( 0.3 %) 100.0 %
Table 2: Table showing the number of constraints added before convergence of the algorithm in Figure 3, broken down by sentencelength. Note that a maximum of 3 constraints are added at each recursive call, but that fewer than 3 constraints are added in caseswhere fewer than 3 constraints have count(i) > 0. x indicates the sentences that fail to converge after 250 iterations. 78.7% of theexamples converge without adding any constraints.
# cons.1-10 words 11-20 words 21-30 words 31-40 words 41-50 words All sentencesA* w/o A* w/o A* w/o A* w/o A* w/o A* w/o
0-0 0.8 0.8 9.7 10.7 47.0 53.7 153.6 178.6 402.6 492.4 64.6 76.11-3 2.4 2.9 23.2 28.0 80.9 102.3 277.4 360.8 686.0 877.7 241.3 309.74-6 0.0 0.0 28.2 38.8 111.7 163.7 309.5 575.2 1,552.8 1,709.2 555.6 699.57-9 0.0 0.0 0.0 0.0 166.1 500.4 361.0 1,467.6 1,167.2 3,222.4 620.7 1,914.1mean 0.8 0.9 10.9 12.3 57.2 72.6 203.4 299.2 679.9 953.4 120.9 168.9median 0.7 0.7 8.9 9.9 48.3 54.6 169.7 202.6 484.0 606.5 35.2 40.0
Table 3: The average time (in seconds) for decoding using the algorithm in Figure 3, with and without A* algorithm, broken downby sentence length and the number of constraints that are added. A* indicates speeding up using A* search; w/o denotes withoutusing A*.
a general purpose integer linear programming (ILP)solver, which solves the problem exactly.
The experiments focus on translation from Ger-man to English, using the Europarl data (Koehn,2005). We tested on 1,824 sentences of length atmost 50 words. The experiments use the algorithmshown in Figure 3. We limit the algorithm to a max-imum of 250 iterations and a maximum of 9 hardconstraints. The distortion limit d is set to be four,and we prune the phrase translation table to have 10English phrases per German phrase.
Our method finds exact solutions on 1,818 outof 1,824 sentences (99.67%). (6 examples do notconverge within 250 iterations.) Table 1 shows thenumber of iterations required for convergence, andTable 2 shows the number of constraints requiredfor convergence, broken down by sentence length.In 1,436/1,818 (78.7%) sentences, the method con-verges without adding hard constraints to tighten therelaxation. For sentences with 1-10 words, the vastmajority (183 out of 185 examples) converge with0 constraints added. As sentences get longer, moreconstraints are often required. However most exam-ples converge with 9 or fewer constraints.
Table 3 shows the average times for decoding,broken down by sentence length, and by the numberof constraints that are added. As expected, decod-
ing times increase as the length of sentences, andthe number of constraints required, increase. Theaverage run time across all sentences is 120.9 sec-onds. Table 3 also shows the run time of the methodwithout the A* algorithm for decoding. The A* al-gorithm gives significant reductions in runtime.
6.1 Comparison to an LP/ILP solverTo compare to a linear programming (LP) or inte-ger linear programming (ILP) solver, we can im-plement the dynamic programming part of our al-gorithm (search over the set Y !) through linear con-straints, with a linear objective. The y(i) = 1 con-straints are also linear. Hence we can encode ourrelaxation within an LP or ILP. Having done this,we tested the resulting LP or ILP using Gurobi, ahigh-performance commercial grade solver. We alsocompare to an LP or ILP where the dynamic pro-gram makes use of states (w1, w2, n, r)—i.e., thespan (l,m) is dropped, making the dynamic pro-gram smaller. Table 4 includes results of the time re-quired to find a solution on sentences of length 1-15.Both the LP and the ILP require very long runningtimes on these shorter sentences, and running timeson longer sentences are prohibitive. Our algorithmis more efficient because it leverages the structure ofthe problem, by directly using a combinatorial algo-rithm (dynamic programming).
![Page 326: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/326.jpg)
Time Required
# cons. 1-10 words 11-20 words 21-30 words 31-40 words 41-50 words All sentences0-0 183 (98.9 %) 511 (91.6 %) 438 (77.4 %) 222 (64.0 %) 82 ( 48.8 %) 1,436 (78.7 %) 78.7 %1-3 2 ( 1.1 %) 45 ( 8.1 %) 94 (16.6 %) 87 (25.1 %) 50 ( 29.8 %) 278 (15.2 %) 94.0 %4-6 0 ( 0.0 %) 2 ( 0.4 %) 27 ( 4.8 %) 24 ( 6.9 %) 19 ( 11.3 %) 72 ( 3.9 %) 97.9 %7-9 0 ( 0.0 %) 0 ( 0.0 %) 7 ( 1.2 %) 13 ( 3.7 %) 12 ( 7.1 %) 32 ( 1.8 %) 99.7 %x 0 ( 0.0 %) 0 ( 0.0 %) 0 ( 0.0 %) 1 ( 0.3 %) 5 ( 3.0 %) 6 ( 0.3 %) 100.0 %
Table 2: Table showing the number of constraints added before convergence of the algorithm in Figure 3, broken down by sentencelength. Note that a maximum of 3 constraints are added at each recursive call, but that fewer than 3 constraints are added in caseswhere fewer than 3 constraints have count(i) > 0. x indicates the sentences that fail to converge after 250 iterations. 78.7% of theexamples converge without adding any constraints.
# cons.1-10 words 11-20 words 21-30 words 31-40 words 41-50 words All sentencesA* w/o A* w/o A* w/o A* w/o A* w/o A* w/o
0-0 0.8 0.8 9.7 10.7 47.0 53.7 153.6 178.6 402.6 492.4 64.6 76.11-3 2.4 2.9 23.2 28.0 80.9 102.3 277.4 360.8 686.0 877.7 241.3 309.74-6 0.0 0.0 28.2 38.8 111.7 163.7 309.5 575.2 1,552.8 1,709.2 555.6 699.57-9 0.0 0.0 0.0 0.0 166.1 500.4 361.0 1,467.6 1,167.2 3,222.4 620.7 1,914.1mean 0.8 0.9 10.9 12.3 57.2 72.6 203.4 299.2 679.9 953.4 120.9 168.9median 0.7 0.7 8.9 9.9 48.3 54.6 169.7 202.6 484.0 606.5 35.2 40.0
Table 3: The average time (in seconds) for decoding using the algorithm in Figure 3, with and without A* algorithm, broken downby sentence length and the number of constraints that are added. A* indicates speeding up using A* search; w/o denotes withoutusing A*.
a general purpose integer linear programming (ILP)solver, which solves the problem exactly.
The experiments focus on translation from Ger-man to English, using the Europarl data (Koehn,2005). We tested on 1,824 sentences of length atmost 50 words. The experiments use the algorithmshown in Figure 3. We limit the algorithm to a max-imum of 250 iterations and a maximum of 9 hardconstraints. The distortion limit d is set to be four,and we prune the phrase translation table to have 10English phrases per German phrase.
Our method finds exact solutions on 1,818 outof 1,824 sentences (99.67%). (6 examples do notconverge within 250 iterations.) Table 1 shows thenumber of iterations required for convergence, andTable 2 shows the number of constraints requiredfor convergence, broken down by sentence length.In 1,436/1,818 (78.7%) sentences, the method con-verges without adding hard constraints to tighten therelaxation. For sentences with 1-10 words, the vastmajority (183 out of 185 examples) converge with0 constraints added. As sentences get longer, moreconstraints are often required. However most exam-ples converge with 9 or fewer constraints.
Table 3 shows the average times for decoding,broken down by sentence length, and by the numberof constraints that are added. As expected, decod-
ing times increase as the length of sentences, andthe number of constraints required, increase. Theaverage run time across all sentences is 120.9 sec-onds. Table 3 also shows the run time of the methodwithout the A* algorithm for decoding. The A* al-gorithm gives significant reductions in runtime.
6.1 Comparison to an LP/ILP solverTo compare to a linear programming (LP) or inte-ger linear programming (ILP) solver, we can im-plement the dynamic programming part of our al-gorithm (search over the set Y !) through linear con-straints, with a linear objective. The y(i) = 1 con-straints are also linear. Hence we can encode ourrelaxation within an LP or ILP. Having done this,we tested the resulting LP or ILP using Gurobi, ahigh-performance commercial grade solver. We alsocompare to an LP or ILP where the dynamic pro-gram makes use of states (w1, w2, n, r)—i.e., thespan (l,m) is dropped, making the dynamic pro-gram smaller. Table 4 includes results of the time re-quired to find a solution on sentences of length 1-15.Both the LP and the ILP require very long runningtimes on these shorter sentences, and running timeson longer sentences are prohibitive. Our algorithmis more efficient because it leverages the structure ofthe problem, by directly using a combinatorial algo-rithm (dynamic programming).
![Page 327: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/327.jpg)
Comparison to LP/ILP Decoding
method ILP LPset length mean median mean median % frac.
Y !! 1-10 275.2 132.9 10.9 4.4 12.4 %11-15 2,707.8 1,138.5 177.4 66.1 40.8 %16-20 20,583.1 3,692.6 1,374.6 637.0 59.7 %
Y ! 1-10 257.2 157.7 18.4 8.9 1.1 %11-15 N/A N/A 476.8 161.1 3.0 %
Table 4: Average and median time of the LP/ILP solver (inseconds). % frac. indicates how often the LP gives a fractionalanswer. Y ! indicates the dynamic program using set Y ! as de-fined in Section 4.1, and Y !! indicates the dynamic program us-ing states (w1, w2, n, r). The statistics for ILP for length 16-20is based on 50 sentences.
6.2 Comparison to MOSESWe now describe comparisons to the phrase-baseddecoder implemented in MOSES. MOSES usesbeam search to find approximate solutions.
The distortion limit described in section 3 is thesame as that in Koehn et al. (2003), and is the sameas that described in the user manual for MOSES(Koehn et al., 2007). However, a complicating fac-tor for our comparisons is that MOSES uses an ad-ditional distortion constraint, not documented in themanual, which we describe here.4 We call this con-straint the gap constraint. We will show in experi-ments that without the gap constraint, MOSES failsto produce translations on many examples. In ourexperiments we will compare to MOSES both withand without the gap constraint (in the latter case, wediscard examples where MOSES fails).
We now describe the gap constraint. For a se-quence of phrases p1, . . . , pk define !(p1 . . . pk) tobe the index of the left-most source-language wordnot translated in this sequence. For example, ifthe bit-string for p1 . . . pk is 111001101000, then!(p1 . . . pk) = 4. A sequence of phrases p1 . . . pLsatisfies the gap constraint if and only if for k =2 . . . L, |t(pk) + 1 ! !(p1 . . . pk)| " d. where d isthe distortion limit. We will call MOSES withoutthis restriction MOSES-nogc, and MOSES with thisrestriction MOSES-gc.
Results for MOSES-nogc Table 5 shows thenumber of examples where MOSES-nogc fails togive a translation, and the number of search errorsfor those cases where it does give a translation, fora range of beam sizes. A search error is defined as acase where our algorithm produces an exact solution
4Personal communication from Philipp Koehn; see also thesoftware for MOSES.
Beam size Fails # search errors percentage100 650/1,818 214/1,168 18.32 %200 531/1,818 207/1,287 16.08 %
1000 342/1,818 115/1,476 7.79 %10000 169/1,818 68/1,649 4.12 %
Table 5: Table showing the number of examples whereMOSES-nogc fails to give a translation, and the num-ber/percentage of search errors for cases where it does give atranslation.
Diff. MOSES-gc MOSES-gc MOSES-nogcs =100 s =200 s=1000
0.000 – 0.125 66 (24.26%) 65 (24.07%) 32 ( 27.83%)0.125 – 0.250 59 (21.69%) 58 (21.48%) 25 ( 21.74%)0.250 – 0.500 65 (23.90%) 65 (24.07%) 25 ( 21.74%)0.500 – 1.000 49 (18.01%) 49 (18.15%) 23 ( 20.00%)1.000 – 2.000 31 (11.40%) 31 (11.48%) 5 ( 4.35%)2.000 – 4.000 2 ( 0.74%) 2 ( 0.74%) 3 ( 2.61%)4.000 –13.000 0 ( 0.00%) 0 ( 0.00%) 2 ( 1.74%)
Table 6: Table showing statistics for the difference between thetranslation score from MOSES, and from the optimal deriva-tion, for those sentences where a search error is made. ForMOSES-gc we include cases where the translation produced byour system is not reachable by MOSES-gc. The average scoreof the optimal derivations is -62.43.
that has higher score than the output from MOSES-nogc. The number of search errors is significant,even for large beam sizes.
Results for MOSES-gc MOSES-gc uses the gapconstraint, and thus in some cases our decoder willproduce derivations which MOSES-gc cannot reach.Among the 1,818 sentences where we produce a so-lution, there are 270 such derivations. For the re-maining 1,548 sentences, MOSES-gc makes searcherrors on 2 sentences (0.13%) when the beam size is100, and no search errors when the beam size is 200,1,000, or 10,000.
Finally, table 6 shows statistics for the magnitudeof the search errors that MOSES-gc and MOSES-nogc make.
7 ConclusionsWe have described an exact decoding algorithm forphrase-based translation models, using Lagrangianrelaxation. The algorithmic construction we havedescribed may also be useful in other areas of NLP,for example natural language generation. Possi-ble extensions to the approach include methods thatincorporate the Lagrangian relaxation formulationwithin learning algorithms for statistical MT: we seethis as an interesting avenue for future research.
![Page 328: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/328.jpg)
Number of Iterations Required
# iter. 1-10 words 11-20 words 21-30 words 31-40 words 41-50 words All sentences0-7 166 (89.7 %) 219 (39.2 %) 34 ( 6.0 %) 2 ( 0.6 %) 0 ( 0.0 %) 421 (23.1 %) 23.1 %8-15 17 ( 9.2 %) 187 (33.5 %) 161 (28.4 %) 30 ( 8.6 %) 3 ( 1.8 %) 398 (21.8 %) 44.9 %16-30 1 ( 0.5 %) 93 (16.7 %) 208 (36.7 %) 112 (32.3 %) 22 ( 13.1 %) 436 (23.9 %) 68.8 %31-60 1 ( 0.5 %) 52 ( 9.3 %) 105 (18.6 %) 99 (28.5 %) 62 ( 36.9 %) 319 (17.5 %) 86.3 %61-120 0 ( 0.0 %) 7 ( 1.3 %) 54 ( 9.5 %) 89 (25.6 %) 45 ( 26.8 %) 195 (10.7 %) 97.0 %121-250 0 ( 0.0 %) 0 ( 0.0 %) 4 ( 0.7 %) 14 ( 4.0 %) 31 ( 18.5 %) 49 ( 2.7 %) 99.7 %x 0 ( 0.0 %) 0 ( 0.0 %) 0 ( 0.0 %) 1 ( 0.3 %) 5 ( 3.0 %) 6 ( 0.3 %) 100.0 %
Table 1: Table showing the number of iterations taken for the algorithm to converge. x indicates sentences that fail toconverge after 250 iterations. 97% of the examples converge within 120 iterations.
have or haven’t been translated in a hypothesis (par-tial derivation). Note that if C = {1 . . . N}, we haveY !C = Y , and the dynamic program will correspond
to exhaustive dynamic programming.
We can again run a Lagrangian relaxation algo-rithm, using the set Y !
C in place of Y !. We will useLagrange multipliers u(i) to enforce the constraintsy(i) = 1 for i /! C. Our goal will be to find asmall set of constraints C, such that Lagrangian re-laxation will successfully recover an optimal solu-tion. We will do this by incrementally adding el-ements to C; that is, by incrementally adding con-straints that tighten the relaxation.
The intuition behind our approach is as follows.Say we run the original algorithm, with the set Y !,for several iterations, so that L(u) is close to con-vergence (i.e., L(u) is close to its minimal value).However, assume that we have not yet generated asolution yt such that yt(i) = 1 for all i. In this casewe have some evidence that the relaxation may notbe tight, and that we need to add some constraints.The question is, which constraints to add? To an-swer this question, we run the subgradient algorithmfor K more iterations (e.g., K = 10), and at each it-eration track which constraints of the form y(i) = 1are violated. We then choose C to be the G con-straints (e.g., G = 3) that are violated most oftenduring the K additional iterations, and are not ad-jacent to each other. We recursively call the algo-rithm, replacing Y ! by Y !
C ; the recursive call maythen return an exact solution, or alternatively againadd more constraints and make a recursive call.
Figure 3 depicts the resulting algorithm. We ini-tially make a call to the algorithm Optimize(C, u)with C equal to the empty set (i.e., no hard con-straints), and with u(i) = 0 for all i. In an initialphase the algorithm runs subgradient steps, while
the dual is still improving. In a second step, if a so-lution has not been found, the algorithm runs for Kmore iterations, thereby choosing G additional con-straints, then recursing.
If at any stage the algorithm finds a solution y"
such that y"(i) = 1 for all i, then this is the so-lution to our original problem, argmaxy#Y f(y).This follows because for any C " {1 . . . N} wehave Y " Y !
C ; hence the theorems in section 4.3 gothrough for Y !
C in place of Y !, with trivial modifica-tions. Note also that the algorithm is guaranteed toeventually find the optimal solution, because even-tually C = {1 . . . N}, and Y = Y !
C .The remaining question concerns the “dual still
improving” condition; i.e., how to determine that thefirst phase of the algorithm should terminate. We dothis by recording the first and second best dual val-ues L(u!) and L(u!!) in the sequence of Lagrangemultipliers u1, u2, . . . generated by the algorithm.Suppose that L(u!!) first occurs at iteration t!!. IfL(u!)$L(u!!)
t$t!! < !, we say that the dual value does notdecrease enough. The value for ! is a parameter ofthe approach: in experiments we used ! = 0.002.
See the supplementary material for this submis-sion for an example run of the algorithm.
When C #= $, A* search can be used for de-coding, with the dynamic program for Y ! provid-ing admissible estimates for the dynamic programfor Y !
C . Experiments show that A* gives significantimprovements in efficiency. The supplementary ma-terial contains a full description of the A* algorithm.
6 Experiments
In this section, we present experimental results todemonstrate the efficiency of the decoding algo-rithm. We compare to MOSES (Koehn et al., 2007),a phrase-based decoder using beam search, and to
![Page 329: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/329.jpg)
Summary
presented dual decomposition as a method for decoding in NLP
formal guarantees
• gives certificate or approximate solution
• can improve approximate solutions by tightening relaxation
efficient algorithms
• uses fast combinatorial algorithms
• can improve speed with lazy decoding
widely applicable
• demonstrated algorithms for a wide range of NLP tasks(parsing, tagging, alignment, mt decoding)
![Page 330: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/330.jpg)
References IY. Chang and M. Collins. Exact Decoding of Phrase-based
Translation Models through Lagrangian Relaxation. In Toappear proc. of EMNLP, 2011.
J. DeNero and K. Macherey. Model-Based Aligner CombinationUsing Dual Decomposition. In Proc. ACL, 2011.
Michael Held and Richard M. Karp. The traveling-salesmanproblem and minimum spanning trees: Part ii. MathematicalProgramming, 1:6–25, 1971. ISSN 0025-5610. URLhttp://dx.doi.org/10.1007/BF01584070.10.1007/BF01584070.
D. Klein and C.D. Manning. Factored A* Search for Models overSequences and Trees. In Proc IJCAI, volume 18, pages1246–1251. Citeseer, 2003.
N. Komodakis, N. Paragios, and G. Tziritas. Mrf energyminimization and beyond via dual decomposition. IEEETransactions on Pattern Analysis and Machine Intelligence,2010. ISSN 0162-8828.
![Page 331: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/331.jpg)
References IITerry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola,
and David Sontag. Dual decomposition for parsing withnon-projective head automata. In EMNLP, 2010. URLhttp://www.aclweb.org/anthology/D10-1125.
B.H. Korte and J. Vygen. Combinatorial Optimization: Theory andAlgorithms. Springer Verlag, 2008.
C. Lemarechal. Lagrangian Relaxation. In ComputationalCombinatorial Optimization, Optimal or Provably Near-OptimalSolutions [based on a Spring School], pages 112–156, London,UK, 2001. Springer-Verlag. ISBN 3-540-42877-1.
Angelia Nedic and Asuman Ozdaglar. Approximate primalsolutions and rate analysis for dual subgradient methods. SIAMJournal on Optimization, 19(4):1757–1780, 2009.
Christopher Raphael. Coarse-to-fine dynamic programming. IEEETransactions on Pattern Analysis and Machine Intelligence, 23:1379–1390, 2001.
![Page 332: Dual Decomposition for Natural Language Processingpeople.seas.harvard.edu/~srush/dual_decomp_tutorial.pdf · 2018-07-02 · Dual Decomposition for Natural Language Processing](https://reader034.vdocument.in/reader034/viewer/2022042302/5ecd3fadc597974194584f9b/html5/thumbnails/332.jpg)
References IIIA.M. Rush and M. Collins. Exact Decoding of Syntactic
Translation Models through Lagrangian Relaxation. In Proc.ACL, 2011.
A.M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On DualDecomposition and Linear Programming Relaxations for NaturalLanguage Processing. In Proc. EMNLP, 2010.
Hanif D. Sherali and Warren P. Adams. A hierarchy of relaxationsand convex hull characterizations for mixed-integer zero–oneprogramming problems. Discrete Applied Mathematics, 52(1):83– 106, 1994.
D.A. Smith and J. Eisner. Dependency Parsing by BeliefPropagation. In Proc. EMNLP, pages 145–156, 2008. URLhttp://www.aclweb.org/anthology/D08-1016.
D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss.Tightening LP relaxations for MAP using message passing. InProc. UAI, 2008.