Download - Dependency Parsing by Belief Propagation
![Page 1: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/1.jpg)
11
David A. Smith (JHU UMass Amherst)
Jason Eisner (Johns Hopkins University)
Dependency Parsingby Belief Propagation
![Page 2: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/2.jpg)
2
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 3: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/3.jpg)
3
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 4: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/4.jpg)
4
MOD
Word Dependency Parsing
He reckons the current account deficit will narrow to only 1.8 billion in September.
Raw sentence
Part-of-speech tagging
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
POS-tagged sentence
Word dependency parsing
slide adapted from Yuji Matsumoto
Word dependency parsed sentence
He reckons the current account deficit will narrow to only 1.8 billion in September .
SUBJ
ROOT
S-COMP
SUBJ
SPEC
MODMOD
COMPCOMP
![Page 5: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/5.jpg)
5
What does parsing have to do with belief propagation?
loopy belief propagation
beliefloopy propagation
![Page 6: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/6.jpg)
6
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge
features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 7: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/7.jpg)
77
Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models.
p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …each choice depends on a limited part of the history
but which dependencies to allow?what if they’re all worthwhile?
p(D | A,B,C)?p(D | A,B,C)?
… p(D | A,B) * p(C | A,B,D)?
![Page 8: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/8.jpg)
88
Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models.
Solution: Log-linear (max-entropy) modeling
Features may interact in arbitrary ways Iterative scaling keeps adjusting the feature weights
until the model agrees with the training data.
p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …which dependencies to allow? (given limited training data)
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …throw them all in!
![Page 9: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/9.jpg)
9
Log-linear models great for n-way classification Also good for predicting sequences
Also good for dependency parsing
9
How about structured outputs?
but to allow fast dynamic programming, only use n-gram features
but to allow fast dynamic programming or MST parsing,only use single-edge features
…find preferred links…
find preferred tags
v a n
![Page 10: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/10.jpg)
1010
How about structured outputs? but to allow fast dynamic
programming or MST parsing,only use single-edge features
…find preferred links…
![Page 11: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/11.jpg)
11
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
Is this a good edge?
yes, lots of green ...
![Page 12: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/12.jpg)
12
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
Is this a good edge?
jasný den(“bright day”)
![Page 13: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/13.jpg)
13
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
Is this a good edge?
jasný den(“bright day”)
jasný N(“bright NOUN”)
V A A A N J N V C
![Page 14: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/14.jpg)
14
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
Is this a good edge?
jasný den(“bright day”)
jasný N(“bright NOUN”)
V A A A N J N V C
A N
![Page 15: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/15.jpg)
15
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
Is this a good edge?
jasný den(“bright day”)
jasný N(“bright NOUN”)
V A A A N J N V C
A Npreceding
conjunction A N
![Page 16: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/16.jpg)
16
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
How about this competing edge?
V A A A N J N V C
not as good, lots of red ...
![Page 17: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/17.jpg)
17
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
How about this competing edge?
V A A A N J N V C
jasný hodiny(“bright clocks”)
... undertrained ...
![Page 18: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/18.jpg)
18
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
How about this competing edge?
V A A A N J N V C
byl jasn stud dubn den a hodi odbí třin
jasný hodiny(“bright clocks”)
... undertrained ...
jasn hodi(“bright clock,”
stems only)
![Page 19: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/19.jpg)
19
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
How about this competing edge?
V A A A N J N V C
jasn hodi(“bright clock,”
stems only)
byl jasn stud dubn den a hodi odbí třin
Aplural Nsingular
jasný hodiny(“bright clocks”)
... undertrained ...
![Page 20: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/20.jpg)
20
jasný hodiny(“bright clocks”)
... undertrained ...
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
How about this competing edge?
V A A A N J N V C
jasn hodi(“bright clock,”
stems only)
byl jasn stud dubn den a hodi odbí třin
Aplural Nsingular
A N where N follows
a conjunction
![Page 21: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/21.jpg)
21
jasný
Edge-Factored Parsers (McDonald et al. 2005)
Byl studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
V A A A N J N V C
byl jasn stud dubn den a hodi odbí třin
Which edge is better? “bright day” or “bright clocks”?
![Page 22: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/22.jpg)
22
jasný
Edge-Factored Parsers (McDonald et al. 2005)
Byl studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen”was a in the striking
V A A A N J N V C
byl jasn stud dubn den a hodi odbí třin
Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score
our current weight vector
![Page 23: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/23.jpg)
23
Edge-Factored Parsers (McDonald et al. 2005)
Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score
our current weight vector
can’t have both(one parent per word)
can‘t have both(no crossing links)
Can’t have all three(no cycles)
Thus, an edge may lose (or win) because of a consensus of other edges.
![Page 24: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/24.jpg)
24
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 25: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/25.jpg)
25
Finding Highest-Scoring Parse
The cat in the hat wore a stovepipe. ROOT
Convert to context-free grammar (CFG) Then use dynamic programming
each subtree is a linguistic constituent(here a noun phrase)
Thecat
in
thehat
wore
astovepipe
ROOT
let’s vertically stretch this graph drawing
![Page 26: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/26.jpg)
26
Finding Highest-Scoring Parse
each subtree is a linguistic constituent(here a noun phrase)
Thecat
in
thehat
wore
astovepipe
ROOT so CKY’s “grammar constant” is no longer constant
Convert to context-free grammar (CFG) Then use dynamic programming
CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case
to score “cat wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...}
![Page 27: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/27.jpg)
27
Finding Highest-Scoring Parse
each subtree is a linguistic constituent(here a noun phrase)
Thecat
in
thehat
wore
astovepipe
ROOT
Convert to context-free grammar (CFG) Then use dynamic programming
CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner
1996) Back to O(n3)
![Page 28: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/28.jpg)
28
Spans vs. constituents
Two kinds of substring.» Constituent of the tree: links to the rest
only through its headword (root).
» Span of the tree: links to the rest only through its endwords.
The cat in the hat wore a stovepipe. ROOT
The cat in the hat wore a stovepipe. ROOT
![Page 29: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/29.jpg)
Decomposing a tree into spans
The cat in the hat wore a stovepipe. ROOT
The cat
wore a stovepipe. ROOTcat in the hat wore
+
+
in the hat worecat in +
hat worein the hat +
cat in the hat wore a stovepipe. ROOT
![Page 30: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/30.jpg)
30
Finding Highest-Scoring Parse
Convert to context-free grammar (CFG) Then use dynamic programming
CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner
1996) Back to O(n3)
Can play usual tricks for dynamic programming parsing Further refining the constituents or spans
Allow prob. model to keep track of even more internal information A*, best-first, coarse-to-fine Training by EM etc.
require “outside” probabilitiesof constituents, spans, or links
![Page 31: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/31.jpg)
31
Hard Constraints on Valid Trees
Score of an edge e = features(e) Standard algos valid parse with max total score
our current weight vector
can’t have both(one parent per word)
can‘t have both(no crossing links)
Can’t have all three(no cycles)
Thus, an edge may lose (or win) because of a consensus of other edges.
![Page 32: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/32.jpg)
32
talk
Non-Projective Parses
can‘t have both(no crossing links)
The “projectivity” restriction.Do we really want it?
I give a on bootstrappingtomorrowROOT ‘ll
subtree rooted at “talk”is a discontiguous noun phrase
![Page 33: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/33.jpg)
33
Non-Projective Parses
ista meam norit gloria canitiemROOT
I give a on bootstrappingtalk tomorrowROOT ‘ll
thatNOM myACC may-know gloryNOM going-grayACC
That glory may-know my going-gray (i.e., it shall last till I go gray)
occasional non-projectivity in English
frequent non-projectivity in Latin, etc.
![Page 34: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/34.jpg)
34
Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective. Can be found in time O(n2).
9
10
30
20
30 0
11
3
9
root
John
saw
Mary
10
30
30
root
John
saw
Mary
slide thanks to Dragomir Radev
Every node selects best parentIf cycles, contract them and repeat
![Page 35: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/35.jpg)
35
Summing over all non-projective treesFinding highest-scoring non-
projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) – may be non-projective. Can be found in time O(n2).
How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n3) by matrix determinants and inverses (Smith & Smith, 2007).
slide thanks to Dragomir Radev
![Page 36: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/36.jpg)
36
Graph Theory to the Rescue!
Tutte’s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.
Exactly the Z we need!
O(n3) time!
![Page 37: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/37.jpg)
37
Building the Kirchoff (Laplacian) Matrix
0 s(1,0) s(2,0) s(n,0)
0 0 s(2,1) s(n,1)
0 s(1,2) 0 s(n,2)
0 s(1,n) s(2,n) 0
0 s(1,0) s(2,0) s(n,0)
0 s(1, j)j1
s(2,1) s(n,1)
0 s(1,2) s(2, j)j2
s(n,2)
0 s(1,n) s(2,n) s(n, j)
jn
nj
j
j
jnsnsns
nsjss
nssjs
),(),2(),1(
)2,(),2()2,1(
)1,()1,2(),1(
2
1
• Negate edge scores• Sum columns
(children)• Strike root row/col.• Take determinant
N.B.: This allows multiple children of root, but see Koo et al. 2007.
![Page 38: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/38.jpg)
38
Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur
K K with contracted edge 1,2
K K({1,2} |{1,2})
K s(1,2) K K
s(1, j)j1
s(2,1) s(n,1)
s(1,2) s(2, j)j2
s(n,2)
s(1,n) s(2,n) s(n, j)
jn
Clear for 1x1 matrix; use induction
Undirected case; special root cases for directed
![Page 39: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/39.jpg)
39
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 40: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/40.jpg)
4040
Exactly Finding the Best Parse
With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming
Non-projective: O(n2) by minimum spanning tree
but to allow fast dynamic programming or MST parsing,only use single-edge features
…find preferred links…
O(n4)
grandparents
O(n5)
grandp.+ siblingbigrams
O(n3g6)
POS trigrams
… O(2n)
sibling pairs (non-adjacent)
NP-hard
•any of the above features•soft penalties for crossing links•pretty much anything else!
![Page 41: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/41.jpg)
4141
Let’s reclaim our freedom (again!)
Output probability is a product of local factors Throw in any factors we want! (log-linear model)
How could we find best parse? Integer linear programming (Riedel et al., 2006)
doesn’t give us probabilities when training or parsing MCMC
Slow to mix? High rejection rate because of hard TREE constraint? Greedy hill-climbing (McDonald & Pereira 2006)
This paper in a nutshell
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …
none of these exploittree structure of parsesas the first-order methods do
![Page 42: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/42.jpg)
4242
Let’s reclaim our freedom (again!)
Output probability is a product of local factors Throw in any factors we want! (log-linear model)
Let local factors negotiate via “belief propagation”Links (and tags) reinforce or suppress one another
Each iteration takes total time O(n2) or O(n3)
Converges to a pretty good (but approx.) global parse
certain global factors ok too
each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside)
This paper in a nutshell
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …
![Page 43: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/43.jpg)
43
Let’s reclaim our freedom (again!)Training with many features Decoding with many features
Iterative scaling Belief propagation
Each weight in turn is influenced by others
Each variable in turn is influenced by others
Iterate to achieve globally optimal weights
Iterate to achievelocally consistent beliefs
To train distrib. over trees, use dynamic programming to compute normalizer Z
To decode distrib. over trees,
use dynamic programming to compute messages
This paper in a nutshell
New!
![Page 44: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/44.jpg)
44
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 45: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/45.jpg)
45
First, a familiar example Conditional Random Field (CRF) for POS tagging
45
Local factors in a graphical model
……
find preferred tags
v v v
Possible tagging (i.e., assignment to remaining variables)
Observed input sentence (shaded)
![Page 46: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/46.jpg)
4646
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v a n
Possible tagging (i.e., assignment to remaining variables)Another possible tagging
Observed input sentence (shaded)
![Page 47: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/47.jpg)
4747
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v n av 0 2 1n 2 1 0a 0 3 1
v n av 0 2 1n 2 1 0a 0 3 1
”Binary” factor that measures
compatibility of 2 adjacent tags
Model reusessame parameters
at this position
![Page 48: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/48.jpg)
4848
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v 0.2n 0.2a 0
“Unary” factor evaluates this tagIts values depend on corresponding word
can’t be adj
v 0.2n 0.2a 0
![Page 49: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/49.jpg)
4949
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v 0.2n 0.2a 0
“Unary” factor evaluates this tagIts values depend on corresponding word
(could be made to depend onentire observed sentence)
![Page 50: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/50.jpg)
5050
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v 0.2n 0.2a 0
“Unary” factor evaluates this tagDifferent unary factor at each position
v 0.3n 0.02a 0
v 0.3n 0a 0.1
![Page 51: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/51.jpg)
5151
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0.02a 0
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0a 0.1
v 0.2n 0.2a 0
v a n
p(v a n) is proportionalto the product of all
factors’ values on v a n
![Page 52: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/52.jpg)
5252
Local factors in a graphical model First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0.02a 0
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0a 0.1
v 0.2n 0.2a 0
v a n
= … 1*3*0.3*0.1*0.2 …
p(v a n) is proportionalto the product of all
factors’ values on v a n
![Page 53: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/53.jpg)
5353
First, a familiar example CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
v a n
Local factors in a graphical model
find preferred links ……
![Page 54: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/54.jpg)
54
First, a familiar example CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
54
Local factors in a graphical model
find preferred links ……
tf
ft
ff
Possible parse— encoded as an assignment to these vars
v a n
![Page 55: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/55.jpg)
55
First, a familiar example CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
55
Local factors in a graphical model
find preferred links ……
ff
tf
tf
Possible parse— encoded as an assignment to these varsAnother possible parse
v a n
![Page 56: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/56.jpg)
56
First, a familiar example CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
(cycle)
56
Local factors in a graphical model
find preferred links ……
ft
ttf
Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parse
v a n
f
![Page 57: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/57.jpg)
57
First, a familiar example CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
(cycle)
57
Local factors in a graphical model
find preferred links ……
t
tt
Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parse
v a n
t
(multiple parents)
f
f
![Page 58: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/58.jpg)
58
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation
58
Local factors for parsing
find preferred links ……
t 2f 1
t 1f 2
t 1f 2
t 1f 6
t 1f 3
as before, goodness of this link can depend on entireobserved input context
t 1f 8
some other linksaren’t as goodgiven this input
sentence
But what if the best assignment isn’t a tree??
![Page 59: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/59.jpg)
5959
Global factors for parsing So what factors shall we multiply to define parse probability?
Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1
find preferred links ……
ffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
![Page 60: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/60.jpg)
60
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1
60
Global factors for parsing
find preferred links ……
ffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
tf
ft
ff 64 entries (0/1)
So far, this is equivalent toedge-factored parsing(McDonald et al. 2005).
Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!
optionally require the tree to be projective (no crossing links)
we’relegal!
![Page 61: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/61.jpg)
61
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables
grandparent
61
Local factors for parsing
find preferred links ……
f t
f 1 1
t 1 3
t
t
3
![Page 62: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/62.jpg)
62
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables
grandparent no-cross
62
Local factors for parsing
find preferred links ……
t
by
t
f t
f 1 1
t 1 0.2
![Page 63: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/63.jpg)
6363
Local factors for parsing
find preferred links …… by
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables
grandparent no-cross siblings hidden POS tags subcategorization …
![Page 64: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/64.jpg)
64
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief
propagation Experiments
Conclusions
New!
Old
![Page 65: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/65.jpg)
65
Good to have lots of features, but … Nice model Shame about the NP-hardness
Can we approximate?
Machine learning to the rescue! ML community has given a lot to NLP In the 2000’s, NLP has been giving back to ML
Mainly techniques for joint prediction of structures Much earlier, speech recognition had HMMs, EM, smoothing …
65
![Page 66: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/66.jpg)
66
Great Ideas in ML: Message Passing
66
3 behind
you
2 behind
you
1 behind
you
4 behind
you
5 behind
you
1 before
you
2 before
you
there’s1 of me
3 before
you
4 before
you
5 before
you
adapted from MacKay (2003) textbook
Count the soldiers
![Page 67: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/67.jpg)
67
Great Ideas in ML: Message Passing
67
3 behind
you
2 before
you
there’s1 of me
Belief:Must be
2 + 1 + 3 = 6 of us
only seemy incoming
messages
2 31
Count the soldiers
adapted from MacKay (2003) textbook
![Page 68: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/68.jpg)
68
Belief:Must be
2 + 1 + 3 = 6 of us
2 31
Great Ideas in ML: Message Passing
68
4 behind
you
1 before
you
there’s1 of me
only seemy incoming
messages
Belief:Must be
1 + 1 + 4 = 6 of us
1 41
Count the soldiers
adapted from MacKay (2003) textbook
![Page 69: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/69.jpg)
69
Great Ideas in ML: Message Passing
69
7 here
3 here
11 here(=
7+3+1)
1 of me
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
![Page 70: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/70.jpg)
70
Great Ideas in ML: Message Passing
70
3 here
3 here
7 here(=
3+3+1)
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
![Page 71: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/71.jpg)
71
Great Ideas in ML: Message Passing
71
7 here
3 here
11 here(=
7+3+1)
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
![Page 72: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/72.jpg)
72
Great Ideas in ML: Message Passing
72
7 here
3 here
3 here
Belief:Must be14 of us
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
![Page 73: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/73.jpg)
73
Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree
73
7 here
3 here
3 here
Belief:Must be14 of us
wouldn’t work correctly
with a “loopy” (cyclic) graph
adapted from MacKay (2003) textbook
![Page 74: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/74.jpg)
7474
……
find preferred tags
Great ideas in ML: Forward-Backward
v 0.3n 0a 0.1
v 1.8n 0a 4.2
α βα
belief
message message
v 2n 1a 7
In the CRF, message passing = forward-backward
v 7n 2a 1
v 3n 1a 6
βv n a
v 0 2 1n 2 1 0a 0 3 1
v 3n 6a 1
v n av 0 2 1n 2 1 0a 0 3 1
![Page 75: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/75.jpg)
75
Extend CRF to “skip chain” to capture non-local factor More influences on belief
75
……
find preferred tags
Great ideas in ML: Forward-Backward
v 3n 1a 6
v 2n 1a 7
α β
v 3n 1a 6
v 5.4n 0a 25.2
v 0.3n 0a 0.1
![Page 76: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/76.jpg)
76
Extend CRF to “skip chain” to capture non-local factor More influences on belief Graph becomes loopy
76
……
find preferred tags
Great ideas in ML: Forward-Backward
v 3n 1a 6
v 2n 1a 7
α β
v 3n 1a 6
v 5.4`n 0a 25.2`
v 0.3n 0a 0.1
Red messages not independent?Pretend they are!
![Page 77: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/77.jpg)
77
Two great tastes that taste great together
You got dynamic
programming in my belief propagation!
You got belief propagation in my dynamic
programming!
Upcoming attractions …
![Page 78: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/78.jpg)
7878
Loopy Belief Propagation for Parsing
find preferred links ……
Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another
parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” …
![Page 79: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/79.jpg)
79
Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links …
79
Loopy Belief Propagation for Parsing
find preferred links ……
![Page 80: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/80.jpg)
80
Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …
How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
80
Loopy Belief Propagation for Parsing
find preferred links ……
?TREE factorffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
![Page 81: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/81.jpg)
81
How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
81
Loopy Belief Propagation for Parsing
find preferred links ……
?
TREE factorffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
But this is the outside probability of green link!
TREE factor computes all outgoing messages at once (given all incoming messages)
Projective case: total O(n3) time by inside-outside
Non-projective: total O(n3) time by inverting Kirchhoff
matrix (Smith & Smith, 2007)
![Page 82: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/82.jpg)
82
How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
82
Loopy Belief Propagation for Parsing
But this is the outside probability of green link!
TREE factor computes all outgoing messages at once (given all incoming messages)
Projective case: total O(n3) time by inside-outside
Non-projective: total O(n3) time by inverting Kirchhoff
matrix (Smith & Smith, 2007)
Belief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms
(fast, no grammar constant).
![Page 83: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/83.jpg)
83
Some connections …
Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)
Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994)
Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation
As far as we know, our parser is the first use of global constraints in sum-product BP.
![Page 84: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/84.jpg)
84
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 85: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/85.jpg)
85
Runtimes for each factor type (see paper) Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)
+=Additive, not multiplicative!
periteration
![Page 86: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/86.jpg)
86
Runtimes for each factor type (see paper) Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)
+=Additive, not multiplicative!Each “global” factor coordinates an unbounded # of variables
Standard belief propagation would take exponential timeto iterate over all configurations of those variables
See paper for efficient propagators
![Page 87: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/87.jpg)
87
Experimental Details Decoding
Run several iterations of belief propagation Get final beliefs at link variables Feed them into first-order parser This gives the Min Bayes Risk tree (minimizes expected error)
Training BP computes beliefs about each factor, too … … which gives us gradients for max conditional likelihood.
(as in forward-backward algorithm) Features used in experiments
First-order: Individual links just as in McDonald et al. 2005 Higher-order: Grandparent, Sibling bigrams, NoCross
87
![Page 88: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/88.jpg)
88
Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)
Danish Dutch English
Tree+Link 85.5 87.3 88.6
+NoCross 86.1 88.3 89.1
+Grandparent 86.1 88.6 89.4
+ChildSeq 86.5 88.5 90.1
![Page 89: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/89.jpg)
89
Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)
Danish Dutch English
Tree+Link 85.5 87.3 88.6
+NoCross 86.1 88.3 89.1
+Grandparent 86.1 88.6 89.4
+ChildSeq 86.5 88.5 90.1
Best projective parse with all factors
86.0 84.5 90.2
+hill-climbing 86.1 87.6 90.2
exact, slow
doesn’t fixenough edges
![Page 90: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/90.jpg)
90
Time vs. Projective Search Error
…DP 140
Compared with O(n4) DP Compared with O(n5) DP
iterations
iterations
iterations
![Page 91: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/91.jpg)
92
Outline
Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse
Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments
Conclusions
New!
Old
![Page 92: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/92.jpg)
9393
Freedom Regained
Output probability defined as product of local and global factors Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently
Let local factors negotiate via “belief propagation” Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast
e.g., existing parsing algorithms using dynamic programming Each iteration takes total time O(n3) or even O(n2); see paper
Compare reranking or stacking
Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy
This paper in a nutshell
![Page 93: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/93.jpg)
94
Future Opportunities
Efficiently modeling more hidden structure POS tags, link roles, secondary links (DAG-shaped parses)
Beyond dependencies Constituency parsing, traces, lattice parsing
Beyond parsing Alignment, translation Bipartite matching and network flow Joint decoding of parsing and other tasks (IE, MT, reasoning ...)
Beyond text Image tracking and retrieval Social networks
![Page 94: Dependency Parsing by Belief Propagation](https://reader035.vdocument.in/reader035/viewer/2022062309/56814df7550346895dbb6266/html5/thumbnails/94.jpg)
95
thank you