via the matrix-tree theorem structured prediction...
TRANSCRIPT
![Page 1: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/1.jpg)
Structured Prediction Modelsvia the Matrix-Tree Theorem
Terry Koo [email protected]
Amir Globerson [email protected]
Xavier Carreras [email protected]
Michael Collins [email protected]
MIT Computer Science and ArtificialIntelligence Laboratory
![Page 2: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/2.jpg)
Dependency parsing
John saw Mary*
Syntactic structure represented by head-modifierdependencies
![Page 3: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/3.jpg)
Projective vs. non-projective structures
saw a movie today that he likedJohn*
Non-projective structures allow crossing dependencies
Frequent in languages like Czech, Dutch, etc.
Non-projective parsing is max-spanning-tree (McDonald etal., 2005)
![Page 4: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/4.jpg)
Contributions of this work
Fundamental inference algorithms that sum over possiblestructures:
Model type Inference Algorithm
HMM Forward-BackwardGraphical Model Belief PropagationPCFG Inside-OutsideProjective Dep. Trees Inside-OutsideNon-projective Dep. Trees ??
This talk:
Inside-outside-style algorithms for non-projective dependencystructures
An application: training log-linear and max-margin parsers
Independently-developed work: Smith and Smith (2007),McDonald and Satta (2007)
![Page 5: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/5.jpg)
Overview
Background
Matrix-Tree-based inference
Experiments
![Page 6: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/6.jpg)
Edge-factored structured prediction
1 2 30sawJohn Mary*
A dependency tree y is a set of head-modifier dependencies(McDonald et al., 2005; Eisner, 1996)
(h,m) is a dependency with feature vector f(x, h, m)
Y(x) is the set of all possible trees for sentence x
y∗ = argmaxy∈Y(x)
∑(h,m)∈y
w · f(x, h, m)
![Page 7: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/7.jpg)
Training log-linear dependency parsers
Given a training set {(xi, yi)}Ni=1, minimize
L(w) =C
2||w||2 −
N∑i=1
log P (yi |xi;w)
![Page 8: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/8.jpg)
Training log-linear dependency parsers
Given a training set {(xi, yi)}Ni=1, minimize
L(w) =C
2||w||2 −
N∑i=1
log P (yi |xi;w)
Log-linear distribution over trees
P (y |x;w) =1
Z(x;w)exp
∑(h,m)∈y
w · f(x, h, m)
Z(x;w) =
∑y∈Y(x)
exp
∑(h,m)∈y
w · f(x, h, m)
![Page 9: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/9.jpg)
Training log-linear dependency parsers
Gradient-based optimizers evaluate L(w) and ∂L∂w
L(w) =C
2||w||2 −
N∑i=1
∑(h,m)∈yi
w · f(xi, h, m)
C
2||w||2 +
N∑i=1
log Z(xi;w)
Main difficulty: computation of the partition functions
![Page 10: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/10.jpg)
Training log-linear dependency parsers
Gradient-based optimizers evaluate L(w) and ∂L∂w
∂L
∂w= Cw −
N∑i=1
∑(h,m)∈yi
f(xi, h, m)
Cw +N∑
i=1
∑h′,m′
P (h′ → m′ |x;w)f(xi, h′, m′)
The marginals are edge-appearance probabilities
P (h → m |x;w) =∑
y∈Y(x) : (h,m)∈y
P (y |x;w)
![Page 11: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/11.jpg)
Generalized log-linear inference
Vector θ with parameter θh,m for each dependency
P (y |x; θ) =1
Z(x; θ)exp
∑(h,m)∈y
θh,m
Z(x; θ) =
∑y∈Y(x)
exp
∑(h,m)∈y
θh,m
P (h → m |x; θ) =
1
Z(x; θ)
∑y∈Y(x) : (h,m)∈y
exp
∑(h,m)∈y
θh,m
E.g., θh,m = w · f(x, h, m)
![Page 12: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/12.jpg)
Applications of log-linear inference
Generalized inference engine that takes θ as input
Different definitions of θ can be used for log-linear ormax-margin training
w∗LL = argmin
w
[C
2||w||2 −
N∑i=1
log P (yi |xi;w)
]
w∗MM = argmin
w
[C
2||w||2 +
N∑i=1
maxy
(Ei,y −mi,y(w))
]
Exponentiated-gradient updates for max-margin models
Bartlett, Collins, Taskar and McAllester (2004)
Globerson, Koo, Carreras and Collins (2007)
![Page 13: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/13.jpg)
Overview
Background
Matrix-Tree-based inference
Experiments
![Page 14: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/14.jpg)
Single-root vs. multi-root structures
MarysawJohn* John saw Mary*
Multi-root structures allow multiple edges from *
Single-root structures have exactly one edge from *
Independent adaptations of the Matrix-Tree Theorem:Smith and Smith (2007), McDonald and Satta (2007)
![Page 15: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/15.jpg)
Matrix-Tree Theorem (Tutte, 1984)
Given:
1. Directed graph G
2. Edge weights θ
3. A node r in G
2
3
12
1 4
3
A matrix L(r) can be constructed whose determinant is thesum of weighted spanning trees of G rooted at r
![Page 16: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/16.jpg)
Matrix-Tree Theorem (Tutte, 1984)
Given:
1. Directed graph G
2. Edge weights θ
3. A node r in G
2
3
12
41
3
A matrix L(r) can be constructed whose determinant is thesum of weighted spanning trees of G rooted at r∑
= exp {2 + 4}+ exp {1 + 3} = det(L(1))
![Page 17: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/17.jpg)
Matrix-Tree Theorem (Tutte, 1984)
Given:
1. Directed graph G
2. Edge weights θ
3. A node r in G
2
3
1
1
3
2
4
A matrix L(r) can be constructed whose determinant is thesum of weighted spanning trees of G rooted at r∑
= exp {2 + 4}+ exp {1 + 3} = det(L(1))
![Page 18: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/18.jpg)
Matrix-Tree Theorem (Tutte, 1984)
Given:
1. Directed graph G
2. Edge weights θ
3. A node r in G
2
3
1
1
3
2
4
A matrix L(r) can be constructed whose determinant is thesum of weighted spanning trees of G rooted at r∑
= exp {2 + 4}+ exp {1 + 3} = det(L(1))
![Page 19: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/19.jpg)
Multi-root partition function
10 2 3John saw Mary*
Edge weights θ, root r = 0
det(L(0)) = non-projective multi-root partition function
![Page 20: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/20.jpg)
Construction of L(0)
L(0) has a simple construction
off-diagonal: L(0)h,m = − exp {θh,m}
on-diagonal: L(0)m,m =
n∑h′=0
exp {θh,m}
E.g., L(0)3,3
10 2 3John saw Mary*
The determinant of L(0) can be evaluated in O(n3) time
![Page 21: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/21.jpg)
Single-root vs. multi-root structures
MarysawJohn* John saw Mary*
Multi-root structures allow multiple edges from *
Single-root structures have exactly one edge from *
Independent adaptations of the Matrix-Tree Theorem:Smith and Smith (2007), McDonald and Satta (2007)
![Page 22: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/22.jpg)
Single-root partition function
Naıve method for computing the single-root non-projectivepartition function
10 2 3John saw Mary*
![Page 23: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/23.jpg)
Single-root partition function
Naıve method for computing the single-root non-projectivepartition function
10 2 3John saw Mary*
Exclude all root edges except (0, 1)
Computing n determinants requires O(n4) time
![Page 24: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/24.jpg)
Single-root partition function
Naıve method for computing the single-root non-projectivepartition function
10 2 3John saw Mary*
Exclude all root edges except (0, 2)
Computing n determinants requires O(n4) time
![Page 25: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/25.jpg)
Single-root partition function
Naıve method for computing the single-root non-projectivepartition function
10 2 3John saw Mary*
Exclude all root edges except (0, 3)
Computing n determinants requires O(n4) time
![Page 26: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/26.jpg)
Single-root partition function
An alternate matrix L can be constructed such thatdet(L) is the single-root partition function
first row: L1,m = exp {θ0,m}
other rows, on-diagonal: Lm,m =n∑
h′=1
exp {θh,m}
other rows, off-diagonal: Lh,m = − exp {θh,m}
Single-root partition function requires O(n3) time
![Page 27: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/27.jpg)
Non-projective marginals
The log-partition generates the marginals
P (h → m |x; θ) =∂ log Z(x; θ)
∂θh,m
=∂ log det(L)
∂θh,m
=∑
h′,m′
∂ log det(L)
∂Lh′,m′
∂Lh′,m′
∂θh,m
Derivative of log-determinant:∂ log det(L)
∂L=
(L−1
)T
Complexity dominated by matrix inverse, O(n3)
![Page 28: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/28.jpg)
Summary of non-projective inference
Partition function: matrix determinant, O(n3)
Marginals: matrix inverse, O(n3)
Single-root inference: L
Multi-root inference: L(0)
![Page 29: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/29.jpg)
Overview
Background
Matrix-Tree-based inference
Experiments
![Page 30: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/30.jpg)
Log-linear and max-margin training
Log-linear training
w∗LL = argmin
w
[C
2||w||2 −
N∑i=1
log P (yi |xi;w)
]
Max-margin training
w∗MM = argmin
w
[C
2||w||2 +
N∑i=1
maxy
(Ei,y −mi,y(w))
]
![Page 31: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/31.jpg)
Multilingual parsing experiments
Six languages from CoNLL 2006 shared task
Training algorithms: averaged perceptron, log-linearmodels, max-margin models
Projective models vs. non-projective models
Single-root models vs. multi-root models
![Page 32: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/32.jpg)
Multilingual parsing experiments
Dutch(4.93%cd)
ProjectiveTraining
Non-ProjectiveTraining
Perceptron 77.17 78.83Log-Linear 76.23 79.55
Max-Margin 76.53 79.69
Non-projective training helps on non-projective languages
![Page 33: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/33.jpg)
Multilingual parsing experiments
Spanish(0.06%cd)
ProjectiveTraining
Non-ProjectiveTraining
Perceptron 81.19 80.02Log-Linear 81.75 81.57
Max-Margin 81.71 81.93
Non-projective training doesn’t hurt on projectivelanguages
![Page 34: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/34.jpg)
Multilingual parsing experiments
Results across all 6 languages (Arabic, Dutch, Japanese,Slovene, Spanish, Turkish)
Perceptron 79.05Log-Linear 79.71
Max-Margin 79.82
Log-linear and max-margin parsers show improvement overperceptron-trained parsers
Improvements are statistically significant (sign test)
![Page 35: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/35.jpg)
Summary
Inside-outside-style inference algorithms for non-projectivestructures
Application of the Matrix-Tree Theorem
Inference for both multi-root and single-root structures
Empirical results
Non-projective training is good for non-projective languages
Log-linear and max-margin parsers outperform perceptronparsers
![Page 36: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/36.jpg)
Thanks!
Thanks for listening!
![Page 37: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/37.jpg)
Thanks!
![Page 38: via the Matrix-Tree Theorem Structured Prediction Modelspeople.csail.mit.edu/maestro/papers/koo07emnlp-slides.pdf · Structured Prediction Models via the Matrix-Tree Theorem Terry](https://reader035.vdocument.in/reader035/viewer/2022081522/5f5805f372aff966ab6063bd/html5/thumbnails/38.jpg)
Challenges for future research
State-of-the-art performance is obtained by higher-ordermodels (McDonald and Pereira, 2006; Carreras, 2007)
Higher-order non-projective inference is nontrivial(McDonald and Pereira, 2006; McDonald and Satta, 2007)
Approximate inference may work well in practice
Reranking of k-best spanning trees (Hall, 2007)