![Page 1: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/1.jpg)
Learning PCFGs: Estimating Parameters,
Learning Grammar RulesMany slides are taken or adapted
from slides byDan Klein
![Page 2: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/2.jpg)
Treebanks
An example tree from the Penn Treebank
![Page 3: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/3.jpg)
The Penn Treebank
• 1 million tokens• In 50,000 sentences, each labeled with– A POS tag for each token– Labeled constituents– “Extra” information• Phrase annotations like “TMP”• “empty” constituents for wh-movement traces, empty
subjects for raising constructions
![Page 4: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/4.jpg)
Supervised PCFG Learning1. Preprocess the treebank
1. Remove all “extra” information (empties, extra annotations)2. Convert to Chomsky Normal Form3. Possibly prune some punctuation, lower-case all words,
compute word “shapes”, and other processing to combat sparsity.
2. Count the occurrence of each nonterminal c(N) and each observed production rule c(N->NL NR) and c(N->w)
3. Set the probability for each rule to the MLE:P(N->NL NR) = c(N->NL NR) / c(N)P(N->w) = c(N->w) / c(N)
Easy, peasy, lemon-squeezy.
![Page 5: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/5.jpg)
Complications
• Smoothing– Especially for lexicalized grammars, many test
productions will never be observed during training– We don’t necessarily want to assign these
productions zero probability– Instead, define backoff distributions, e.g.:Pfinal(VPtransmogrified -> Vtransmogrified PPinto)
= α P(VPtransmogrified -> Vtransmogrified PPinto)
+ (1-α) P(VP -> V PPinto)
![Page 6: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/6.jpg)
Problems with Supervised PCFG Learning
• Coming up with labeled data is hard!– Time-consuming– Expensive – Hard to adapt to new domains,
tasks, languages– Corpus availability drives research
(instead of tasks driving the research)
• Penn Treebank took many person-years to manually annotate it.
![Page 7: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/7.jpg)
Unsupervised Learning of PCFGS: Feasible?
![Page 8: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/8.jpg)
Unsupervised Learning
• Systems take raw data and automatically detect data
• Why?– More data is available– Kids learn (some aspects of)
language with no supervision– Insights into machine learning
and clustering
![Page 9: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/9.jpg)
Grammar Induction and Learnability
• Some have argued that learning syntax from positive data alone is impossible– Gold, 1967: non-identifiability in the limit– Chomsky, 1980: poverty of the stimulus
• Surprising result: it’s possible to get entirely unsupervised parsing to work (reasonably) well.
![Page 10: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/10.jpg)
Learnability
• Learnability: formal conditions under which a class of languages can be learned
• Setup: – Class of languages Λ– Algorithm H (the learner)– H sees a sequence X of strings x1 … xn– H maps sequences X to languages L in Λ
• Question is: for what classes Λ do learners H exist?
![Page 11: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/11.jpg)
Learnability [Gold, 1967]• Criterion: Identification in the limit
– A presentation of L is an infinite sequence of x’s from L in which each x occurs at least once
– A learner H identifies L in the limit if, for any presentation of L, from some point n onwards, H always outputs L
– A class Λ is identifiable in the limit if there is some single H which correctly identifies in the limit every L in Λ.
• Example: L = {{a},{a,b}} is identifiable in the limit.
• Theorem (Gold, 67): Any Λ which contains all finite languages and at least one infinite language (ie is superfinite) is unlearnable in this sense.
![Page 12: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/12.jpg)
Learnability [Gold, 1967]
• Proof sketch– Assume Λ is superfinite, H identifies Λ in the limit– There exists a chain L1 ⊂ L2 ⊂ … L∞
– Construct the following misleading sequence• Present strings from L1 until H outputs L1• Present strings from L2 until H outputs L2• …
– This is a presentation of L∞
but H never outputs L∞
![Page 13: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/13.jpg)
Learnability [Horning, 1969]• Problem, IIL requires that H succeeds on all examples, even the
weird ones
• Another criterion: measure one identification– Assume a distribution PL(x) for each L– Assume PL(x) puts non-zero probability on all and only the x in L– Assume an infinite presentation of x drawn i.i.d. from PL(x)– H measure-one identifies L if the probability of [drawing a sequence X
from which H can identify L] is 1.
• Theorem (Horning, 69): PCFGs can be identified in this sense.– Note: there can be misleading sequences, but they have to be
(infinitely) unlikely
![Page 14: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/14.jpg)
Learnability [Horning, 1969]• Proof sketch
– Assume Λ is a recursively enumerable set of recursive languages (e.g., the set of all PCFGs)
– Assume an ordering on all strings x1 < x2 < …– Define: two sequences A and B agree through n iff for all x<xn, x is in
A x is in B.– Define the error set E(L,n,m):
• All sequences such that the first m elements do not agree with L through n• These are the sequences which contain early strings outside of L (can’t
happen), or which fail to contain all of the early strings in L (happens less as m increases)
– Claim: P(E(L,n,m)) goes to 0 as m goes to ∞– Let dL(n) be the smallest m such that P(E) < 2–n
– Let d(n) be the largest dL(n) in first n languages– Learner: after d(n), pick first L that agrees with evidence through n– This can only fail for sequences X if X keeps showing up in E(L, n, d(n)),
which happens infinitely often with probability zero.
![Page 15: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/15.jpg)
Learnability• Gold’s results say little about real learners (the
requirements are too strong)
• Horning’s algorithm is completely impractical– It needs astronomical amounts of data
• Even measure-one identification doesn’t say anything about tree structures– It only talks about learning grammatical sets– Strong generative vs. weak generative capacity
![Page 16: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/16.jpg)
Unsupervised POS Tagging
• Some (discouraging) experiments [Merialdo 94]
• Setup:– You know the set of allowable tags for each word
(but not frequency of each tag)– Learn a supervised model on k training sentences• Learn P(w|t), P(ti|ti-1,ti-2) on these sentences
– On n>k, reestimate with EM
![Page 17: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/17.jpg)
Merialdo: Results
![Page 18: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/18.jpg)
Grammar Induction
Unsupervised Learning of Grammars and Parameters
![Page 19: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/19.jpg)
Right-branching Baseline
• In English (but not necessarily in other languages), trees tend to be right-branching:
• A simple, English-specific baseline is to choose the right chain structure for each sentence.
![Page 20: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/20.jpg)
Distributional Clustering
![Page 21: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/21.jpg)
Nearest Neighbors
![Page 22: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/22.jpg)
Learn PCFGs with EM [Lari and Young, 1990]
• Setup:– Full binary grammar with n nonterminals {X1, …, Xn}
(that is, at the beginning, the grammar has all possible rules)
– Parse uniformly/randomly at first– Re-estimate rule expecations off of parses– Repeat
• Their conclusion: it doesn’t really work
![Page 23: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/23.jpg)
EM for PCFGs: Details1. Start with a “full” grammar, with all possible binary rules for our
nonterminals N1 … Nk. Designate one of them as the start symbol, say N1
2. Assign some starting distribution to the rules, such as1. Random2. Uniform3. Some “smart” initialization techniques (see assigned reading)
3. E-step: Take an unannotated sentence S, and compute, for all nonterminals N, NL, NR, and all terminals w:
E(N | S), E(N->NL NR, N is used| S), E(N->w, N is used| S)
4. M-step: Reset rule probabilities to the MLE:P(N->NL NR) = E(N->NL NR|S) / E(N | S)P(N->w) = E(N->w | S) / E(N | S)
5. Repeat 3 and 4 until rule probabilities stabilize, or “converge”
![Page 24: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/24.jpg)
E-Step
• Let• We can define the expectations we want in
terms of π, α, β quantities:
)|( 1*
1 GwNP m
m
p
m
pq
jjmj
qpqpwNE
11
*1
),(),()N|derivationin used is (
1
1 1
1
1*
1
),1(),() (),()N| used is , (
m
p
m
pq
q
pd
srrljjmjrlj
qddpNNNPqpwNNNNE
m
p
jjmjj
ppppwNNE
11
*1
),(),()N| used is ,w(
![Page 25: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/25.jpg)
Inside Probabilities
Base case:
Induction:
)|(),|(),( GwNPGNwPii kjj
kkij
rl
q
qdrl
rlj
jpqpqj
qddpNNNP
GNwPqp
,
1
),1(),() (
),|(),(
Nj
Nl Nr
wp wd wd+1 wq
![Page 26: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/26.jpg)
Outside Probabilities
Base case:
Induction:
0),1(
1),1(
1
1
m
m
j
rl
q
qdrl
rlj
jpqpqj
qddpNNNP
GNwPqp
,
1
),1(),() (
),|(),(
Nj
Nl Nr
wp wd wd+1 wq
![Page 27: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/27.jpg)
Problem: Model Symmetries
![Page 28: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/28.jpg)
Distributional Syntax?
![Page 29: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/29.jpg)
Problem: Identifying Constituents
![Page 30: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/30.jpg)
A nested distributional model
• We’d like a model that
– Ties spans to linear contexts (like distributional clustering)
– Considers only proper tree structures (like PCFGs)
– Has no symmetries to break (like a dependency model)
![Page 31: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/31.jpg)
Constituent Context Model (CCM)
![Page 32: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/32.jpg)
Results: Constituency
![Page 33: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/33.jpg)
Results: Dependencies
![Page 34: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/34.jpg)
Results: Combined Models
![Page 35: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e2a5503460f94b179bd/html5/thumbnails/35.jpg)
Multilingual Results