dual decomposition for parsing with non-projective head ...people.csail.mit.edu/srush/talk2.pdf ·...
TRANSCRIPT
![Page 1: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/1.jpg)
Dual Decomposition for Parsingwith Non-Projective Head Automata
Terry Koo, Alexander M. Rush, Michael Collins,David Sontag, and Tommi Jaakkola
![Page 2: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/2.jpg)
The Cost of Model Complexity
We are always looking for better ways to model natural language.
Tradeoff: Richer models ⇒ Harder decoding
Added complexity is both computational and implementational.
Tasks with challenging decoding problems:
I Speech Recognition
I Sequence Modeling (e.g. extensions to HMM/CRF)
I Parsing
I Machine Translation
y∗ = arg maxy
f (y) Decoding
![Page 3: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/3.jpg)
The Cost of Model Complexity
We are always looking for better ways to model natural language.
Tradeoff: Richer models ⇒ Harder decoding
Added complexity is both computational and implementational.
Tasks with challenging decoding problems:
I Speech Recognition
I Sequence Modeling (e.g. extensions to HMM/CRF)
I Parsing
I Machine Translation
y∗ = arg maxy
f (y) Decoding
![Page 4: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/4.jpg)
The Cost of Model Complexity
We are always looking for better ways to model natural language.
Tradeoff: Richer models ⇒ Harder decoding
Added complexity is both computational and implementational.
Tasks with challenging decoding problems:
I Speech Recognition
I Sequence Modeling (e.g. extensions to HMM/CRF)
I Parsing
I Machine Translation
y∗ = arg maxy
f (y) Decoding
![Page 5: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/5.jpg)
Non-Projective Dependency Parsing
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Important problem in many languages.
Problem is NP-Hard for all but the simplest models.
![Page 6: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/6.jpg)
Dual Decomposition
A classical technique for constructing decoding algorithms.
Solve complicated models
y∗ = arg maxy
f (y)
by decomposing into smaller problems.
Upshot: Can utilize a toolbox of combinatorial algorithms.
I Dynamic programming
I Minimum spanning tree
I Shortest path
I Min-Cut
I ...
![Page 7: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/7.jpg)
A Dual Decomposition Algorithmfor Non-Projective Dependency Parsing
Simple - Uses basic combinatorial algorithms
Efficient - Faster than previously proposed algorithms
Strong Guarantees - Gives a certificate of optimality when exact
Solves 98% of examples exactly, even though the problem isNP-Hard
Widely Applicable - Similar techniques extend to other problems
![Page 8: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/8.jpg)
Roadmap
Algorithm
Experiments
Derivation
![Page 9: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/9.jpg)
Non-Projective Dependency Parsing
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
I Starts at the root symbol *
I Each word has a exactly one parent word
I Produces a tree structure (no cycles)
I Dependencies can cross
![Page 10: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/10.jpg)
Algorithm Outline
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Arc-Factored Model
Dual Decomposition
Sibling Model
![Page 11: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/11.jpg)
Algorithm Outline
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
+
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Arc-Factored Model
Dual Decomposition
Sibling Model
![Page 12: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/12.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) =
score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 13: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/13.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2)
+score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 14: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/14.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 15: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/15.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4)
+score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 16: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/16.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 17: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/17.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 18: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/18.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 19: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/19.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 20: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/20.jpg)
Arc-Factored
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)
+score(saw2,movie4) +score(saw2, today5)
+score(movie4, a3) + ...
e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
y∗ = arg maxy
f (y) ⇐ Minimum Spanning Tree Algorithm
![Page 21: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/21.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) =
score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 22: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/22.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 23: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/23.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1)
+score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 24: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/24.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 25: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/25.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 26: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/26.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 27: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/27.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 28: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/28.jpg)
Sibling Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = score(head = ∗0, prev = NULL,mod = saw2)
+score(saw2,NULL, John1) +score(saw2,NULL,movie4)
+score(saw2,movie4, today5) + ...
e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)
or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
y∗ = arg maxy
f (y) ⇐ NP-Hard
![Page 29: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/29.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 30: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/30.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 31: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/31.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 32: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/32.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 33: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/33.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 34: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/34.jpg)
Thought Experiment: Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)
score(saw2,NULL, John1) + score(saw2,NULL, that6)
score(saw2,NULL, a3) + score(saw2, a3,he7)
2n−1
possibilities
Under Sibling Model, can solve for each word with Viterbi decoding.
![Page 35: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/35.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 36: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/36.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 37: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/37.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 38: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/38.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 39: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/39.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 40: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/40.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 41: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/41.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 42: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/42.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 43: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/43.jpg)
Thought Experiment Continued
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Idea: Do individual decoding for each head word using dynamicprogramming.
If we’re lucky, we’ll end up with a valid final tree.
But we might violate some constraints.
![Page 44: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/44.jpg)
Dual Decomposition Idea
NoConstraints
TreeConstraints
Arc-Factored
MinimumSpanning Tree
SiblingModel
IndividualDecoding
![Page 45: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/45.jpg)
Dual Decomposition Idea
NoConstraints
TreeConstraints
Arc-Factored
MinimumSpanning Tree
SiblingModel
IndividualDecoding
DualDecomposition
![Page 46: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/46.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 47: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/47.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 48: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/48.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid Trees
All Possible
Sibling Arc-Factored
Constraint
![Page 49: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/49.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 50: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/50.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling
Arc-Factored
Constraint
![Page 51: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/51.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 52: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/52.jpg)
Dual Decomposition Structure
Goal y∗ = arg maxy∈Y
f (y)
Rewrite as argmax
z∈ Z, y∈ Yf (z) + g(y)
such that z = y
Valid TreesAll Possible
Sibling Arc-Factored
Constraint
![Page 53: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/53.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 54: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/54.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 55: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/55.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 56: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/56.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
![Page 57: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/57.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for all edges.
For k = 1 to K
z(k) ← Decode (f (z) + penalty) by Individual Decoding
y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree
If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))
Else Update penalty weights based on y (k)(i , j)− z(k)(i , j)
![Page 58: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/58.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 59: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/59.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 60: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/60.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 61: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/61.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 62: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/62.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 63: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/63.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 64: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/64.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 65: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/65.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 66: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/66.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 67: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/67.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 68: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/68.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 69: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/69.jpg)
Individual Decoding
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
z∗ = arg maxz∈Z
(f (z) +∑i ,j
u(i , j)z(i , j))
Minimum Spanning Tree
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
y∗ = arg maxy∈Y
(g(y)−∑i ,j
u(i , j)y(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(8, 1) -1
u(4, 6) -1
u(2, 6) 1
u(8, 7) 1
Iteration 2
u(8, 1) -1
u(4, 6) -2
u(2, 6) 2
u(8, 7) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j
![Page 70: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/70.jpg)
Guarantees
TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global
optimum.
In experiments, we find the global optimum on 98% of examples.
If we do not converge to a match, we can still return anapproximate solution (more in the paper).
![Page 71: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/71.jpg)
Guarantees
TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global
optimum.
In experiments, we find the global optimum on 98% of examples.
If we do not converge to a match, we can still return anapproximate solution (more in the paper).
![Page 72: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/72.jpg)
Extensions
I Grandparent Models
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) =...+ score(gp =∗0, head = saw2, prev =movie4,mod =today5)
I Head Automata (Eisner, 2000)
Generalization of Sibling models
Allow arbitrary automata as local scoring function.
![Page 73: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/73.jpg)
Roadmap
Algorithm
Experiments
Derivation
![Page 74: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/74.jpg)
ExperimentsProperties:
I Exactness
I Parsing Speed
I Parsing Accuracy
I Comparison to Individual Decoding
I Comparison to LP/ILP
Training:I Averaged Perceptron (more details in paper)
Experiments on:
I CoNLL Datasets
I English Penn Treebank
I Czech Dependency Treebank
![Page 75: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/75.jpg)
How often do we exactly solve the problem?
90
92
94
96
98
100
CzeEng
DanDut
PorSlo
SweTur
I Percentage of examples where the dual decomposition findsan exact solution.
![Page 76: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/76.jpg)
Parsing Speed
0
10
20
30
40
50
CzeEng
DanDut
PorSlo
SweTur
Sibling model
0
5
10
15
20
25
CzeEng
DanDut
PorSlo
SweTur
Grandparent model
I Number of sentences parsed per second
I Comparable to dynamic programming for projective parsing
![Page 77: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/77.jpg)
Accuracy
Arc-Factored Prev Best Grandparent
Dan 89.7 91.5 91.8Dut 82.3 85.6 85.8Por 90.7 92.1 93.0Slo 82.4 85.6 86.2Swe 88.9 90.6 91.4Tur 75.7 76.4 77.6Eng 90.1 — 92.5Cze 84.4 — 87.3
Prev Best - Best reported results for CoNLL-X data set, includes
I Approximate search (McDonald and Pereira, 2006)
I Loop belief propagation (Smith and Eisner, 2008)
I (Integer) Linear Programming (Martins et al., 2009)
![Page 78: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/78.jpg)
Comparison to Subproblems
88
89
90
91
92
93
Eng
IndividualMSTDual
F1 for dependency accuracy
![Page 79: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/79.jpg)
Comparison to LP/ILP
Martins et al.(2009): Proposes two representations ofnon-projective dependency parsing as a linear programmingrelaxation as well as an exact ILP.
I LP (1)I LP (2)I ILP
Use an LP/ILP Solver for decoding
We compare:
I AccuracyI ExactnessI Speed
Both LP and dual decomposition methods use the same model,features, and weights w .
![Page 80: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/80.jpg)
Comparison to LP/ILP: Accuracy
80
85
90
95
100LP(1)LP(2)
ILPDual
Dependency Accuracy
I All decoding methods have comparable accuracy
![Page 81: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/81.jpg)
Comparison to LP/ILP: Exactness and Speed
80
85
90
95
100LP(1)LP(2)
ILPDual
Percentage with exact solution
0
2
4
6
8
10
12
14LP(1)LP(2)
ILPDual
Sentences per second
![Page 82: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/82.jpg)
Roadmap
Algorithm
Experiments
Derivation
![Page 83: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/83.jpg)
Deriving the Algorithm
Goal:y∗ = arg max
y∈Yf (y)
Rewrite:arg max
z∈Z,y∈Yf (z) + g(y)
s.t. z(i , j) = y(i , j) for all i , j
Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j
u(i , j) (z(i , j)− y(i , j))
The dual problem is to find minu
L(u) where
L(u) = maxy∈Y,z∈Z
L(u, y , z) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ max
y∈Y
g(y)−∑i,j
u(i , j)y(i , j)
Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
![Page 84: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/84.jpg)
Deriving the Algorithm
Goal:y∗ = arg max
y∈Yf (y)
Rewrite:arg max
z∈Z,y∈Yf (z) + g(y)
s.t. z(i , j) = y(i , j) for all i , j
Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j
u(i , j) (z(i , j)− y(i , j))
The dual problem is to find minu
L(u) where
L(u) = maxy∈Y,z∈Z
L(u, y , z) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ max
y∈Y
g(y)−∑i,j
u(i , j)y(i , j)
Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
![Page 85: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/85.jpg)
A Subgradient Algorithm for Minimizing L(u)
L(u) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ maxy∈Y
g(y)−∑i,j
u(i , j)y(i , j)
L(u) is convex, but not differentiable. A subgradient of L(u) at uis a vector gu such that for all v ,
L(v) ≥ L(u) + gu · (v − u)
Subgradient methods use updates u′ = u − αgu
In fact, for our L(u), gu(i , j) = z∗(i , j)− y∗(i , j)
![Page 86: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/86.jpg)
Related Work
I Methods that use general purpose linear programming orinteger linear programming solvers (Martins et al. 2009;Riedel and Clarke 2006; Roth and Yih 2005)
I Dual decomposition/Lagrangian relaxation in combinatorialoptimization (Dantzig and Wolfe, 1960; Held and Karp, 1970;Fisher 1981)
I Dual decomposition for inference in MRFs (Komodakis et al.,2007; Wainwright et al., 2005)
I Methods that incorporate combinatorial solvers within loopybelief propagation (Duchi et al. 2007; Smith and Eisner 2008)
![Page 87: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/87.jpg)
Summary
y∗ = arg maxy
f (y) ⇐ NP-Hard
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Arc-Factored Model
Dual Decomposition
Sibling Model
![Page 88: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/88.jpg)
Summary
y∗ = arg maxy
f (y) ⇐ NP-Hard
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
+
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
Arc-Factored Model
Dual Decomposition
Sibling Model
![Page 89: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/89.jpg)
Other Applications
I Dual decomposition can be applied to other decodingproblems.
I Rush et al. (2010) focuses on integrated dynamicprogramming algorithms.
I Integrated Parsing and Tagging
I Integrated Constituency and Dependency Parsing
![Page 90: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/90.jpg)
Parsing and Tagging
y∗ = arg maxy
f (y) ⇐ Slow
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
S N V D N A D N V
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
S
NP
N
John
VP
V
saw
NP
D
a
NP
N
movie
ADVP
ADV
today
ADVP
D
that
VP
N
he
V
liked
HMM Model
Dual Decomposition
CFG Model
![Page 91: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/91.jpg)
Parsing and Tagging
y∗ = arg maxy
f (y) ⇐ Slow
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
S N V D N A D N V
+
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
S
NP
N
John
VP
V
saw
NP
D
a
NP
N
movie
ADVP
ADV
today
ADVP
D
that
VP
N
he
V
liked
HMM Model
Dual Decomposition
CFG Model
![Page 92: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/92.jpg)
Dependency and Constituency
y∗ = arg maxy
f (y) ⇐ Slow
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
S
NP
N
John
VP
V
saw
NP
D
a
NP
N
movie
ADVP
ADV
today
ADVP
D
that
VP
N
he
V
liked
Dependency Model
Dual Decomposition
Lexicalized CFG
![Page 93: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/93.jpg)
Dependency and Constituency
y∗ = arg maxy
f (y) ⇐ Slow
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
+
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
S
NP
N
John
VP
V
saw
NP
D
a
NP
N
movie
ADVP
ADV
today
ADVP
D
that
VP
N
he
V
liked
Dependency Model
Dual Decomposition
Lexicalized CFG
![Page 94: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/94.jpg)
Future Directions
There is much more to explore around dual decomposition in NLP.
I Known TechniquesI Generalization to more than two modelsI K-best decodingI Approximate subgradientI Heuristic for branch-and-bound type search
I Possible NLP ApplicationsI Machine TranslationI Speech RecognitionI “Loopy” Sequence Models
I Open QuestionsI Can we speed up subalgorithms when running repeatedly?I What are the trade-offs of different decompositions?I Are there better methods for optimizing the dual?
![Page 95: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/95.jpg)
Appendix
![Page 96: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/96.jpg)
Training the Model
*0 John1 saw2 a3 movie4 today5 that6 he7 liked8
f (y) = ... + score(saw2,movie4, today5) + ...
I score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)
I Weight vector w trained using Averaged perceptron.
I (More details in the paper.)
![Page 97: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/97.jpg)
Early Stopping
50
60
70
80
90
100
0 200 400 600 800 1000
Per
cent
age
Maximum Number of Dual Decomposition Iterations
% validation UAS% certificates
% match K=5000
Early Stopping
![Page 98: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking](https://reader034.vdocument.in/reader034/viewer/2022043017/5f3953a91f6b5964b70ff505/html5/thumbnails/98.jpg)
Caching
0
5
10
15
20
25
30
0 1000 2000 3000 4000 5000
% o
f Hea
d A
utom
ata
Rec
ompu
ted
Iterations of Dual Decomposition
% recomputed, g+s% recomputed, sib
Caching speed