Download - Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo
Using Minimum Description Length to make Grammatical Generalizations
Mike Dowman
University of Tokyo
What should Syntactic Theory Explain?
• Which sentences are grammatical and which are not
or
• How to transform observed sentences into a grammar
IE
learning
Children transform observed sentences (E)
Into psychological knowledge of language (I)
How should we study syntax?
Linguists’ Approach:• Choose some sentences• Decide on grammaticality of each oneMake a grammar that accounts for which of
these sentences are grammatical and which are not
sentences grammar
Informant Linguist
Computational Linguists’ Approach(Unsupervised Learning)
• Take a corpusExtract as much information from the corpus as accurately as
possibleorLearn a grammar that describes the corpus as accurately as
possible
corpus
grammarlexical items
language modeletc.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Which approach gives more insight into language?
Linguists tend to aim for high precision• But only produce very limited and arbitrary
coverage
Computational linguists tend to obtain much better coverage
• But don’t account for any body of data completely
• And tend to learn only simpler kinds of structure
The approaches seem to be largely complementary
Which approach gives more insight into the human mind?
The huge size and complexity of languages is one of their key distinctive properties
The linguists’ approach doesn’t account for this
So should we apply our algorithms to large corpora of naturally occurring data?
This won’t directly address the kind of issue that syntacticians focus on
Negative Evidence
• Some constructions seem impossible to learn without negative evidence
John hurt himselfMary hurt JohnJohn hated himselfMary hated JohnJohn behaved himself * Mary behaved John
Implicit Negative Evidence
If we never hear something can’t we just assume its not grammatical?
Sentences we never heard?Phrases we never heard?Verb argument constructions we never heard?Word-affix combinations we never heard?
How often does something have to not occur before we decide it’s not grammatical?
At what structural level do we make generalizations?
Minimum Description Length (MDL)
MDL may be able to solve the ‘no negative evidence’ problem
Prefers the grammar that results in the simplest overall description of data
• So prefers simple grammars
• And grammars that allow simple descriptions of the data
Observed sentences
Space of possible sentences
Observed sentences
Grammar
Simple but non-constraining grammarSpace of possible sentences
Observed sentences
Grammars
Simple but non-constraining grammar
Complex but constraining grammar
Space of possible sentences
Observed sentences
Grammars
Grammar that is a good fit to the data
Simple but non-constraining grammar
Complex but constraining grammar
Space of possible sentences
Why it has to be MDL
Many machine learning techniques have been applied in computational linguistics
MDL is very rarely used
Not especially successful at learning grammatical structure from corpora
So why MDL?
Maximum Likelihood
Maximum likelihood can be seen as a special case of MDL in which the a priori probability of all hypotheses P(h) is equal
But the hypothesis that only the observed sentences are grammatical will result in the maximum likelihood
So ML can only be applied if there are restrictions on how well the estimated parameters can fit the data
The degree of generality of the grammars is set externally, not determined by the Maximum Likelihood principle
Maximum Entropy
Make the grammar as unrestrictive as possible
But constraints must be used to prevent a grammar just allowing any combination of words to be a grammatical sentence
Again the degree of generality of grammars is determined externally
Neither Maximum Likelihood nor Maximum Entropy provide a principle that can decide when to make generalizations
1 S NP VP2 NP John3 NP Mary 4 VP screamed5 VP died
Describing data in terms of the grammar: 1, 2, 4 = John screamed
There is a restricted range of choices at each stage of the derivation
Fewer choices = higher probability
Learning Phrase Structure Grammars
Data:John screamedJohn diedMary Screamed
Encoding in My Model
1010100111010100101101010001100111100011010110
Symbol Frequencies
Rule Frequencies
Decoder
1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died
John screamedJohn diedMary Screamed
Grammar Data
S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)
Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1
Number of bits decoded = evaluation
John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary
Example: EnglishLearned Grammar
S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam
Real Language Data
Can the MDL metric also learn grammars from corpora of unrestricted natural language?
If it could, we’d largely have finished syntax
But search space is way too big
We need to simplify the task in some wayOnly learn verb subcategorization classes
Switchboard Corpus( (S
(C C and)
(PRN
(, ,)
(S
(NP -SBJ (PRP you) )
( VP (VBP know) ))
(, ,) )
( NP-SBJ-1 (PRP she) )
(VP (VBD spent)
(NP
( NP (CD nine) (NNS months) )
( PP (IN out)
(PP (IN of)
(NP (DT the) (NN year) ))))
(S-ADV
(NP -SBJ (-NONE- *-1) )
(A DVP (RB just) )
( VP (VBG visiting)
(NP (PRP$ her) (NNS children) ))))
(. .) (-DFL- E_S) ))
Extracted Information:
Verb: spent
Subcategorization frame: * NP S
Extracted Data
Only verbs tagged as VBD (past tense) extracted
Modifiers to basic labels ignored
21,759 training instances
704 different verbs
706 distinct subcategorization frames
25 different types of constituent appeared alongside the verbs (e.g. S, SBAR, NP, ADVP)
Verb Class GrammarsS Class1 Subcat1
S Class1 Subcat2
S Class2 Subcat1
Class1 grew
Class1 ended
Class2 did
grew and ended appear can appear with subcats 1 and 2
do only with subcat 2
Grouping together verbs with similar subcategorizations should improve the evaluation
A New Search Mechanism
We need a search mechanism that will only produce candidate grammars of the right form
• Start with all verbs in one class• Move a randomly chosen verb to a new class
(P=0.5) or a different class (P=0.5)• Empty verb classes are deleted• Redundant rules are removed
A New Search Mechanism (2)
Annealing search:• After no changes are accepted for 2,000
iterations switch to merging phase• Merge two randomly selected classes• After no changes accepted for 2,000 iterations
switch back to moving phase• Stop after no changes accepted for 20,000
iterations• Multiple runs were conducted and the grammar
with the overall lowest evaluation selected
Grammar Evaluations
207,312.4187,026.7220,520.4Data
37,885.5111,036.529,915.1Grammar
245,198.0298,063.3250,435.5Overall Evaluation
Best learned grammar
Each verb in a separate class
One verb class
Learned ClassesClass Verbs in Class Description
1 thought, vowed, prayed, decided, adjusted, wondered, wished, allowed, knew, suggested, claimed, believed, remarked, resented, detailed, misunderstood, assumed, competed, snowballed, smoked, said, struggled, determined, noted, understood, foresaw, expected, discovered, realized, negotiated, suspected, indicated
Usually take S or
S BAR complement
(S BAR usually contains that or who etc. followed by an S)
2 enjoyed, canceled, liked, had, finished, traded, sold, ruined, needed, watched, loved, included, received, converted, rented, bred, deterred, increased, encouraged, made, swapped, shot, offered, spent, impressed, discussed, missed, carried, injured, presented, surprised…
Usually take an NP argument (often in conjunction with other arguments)
3 did did only
4 All other verbs miscellaneous
5 used, named, tried, considered, tended, refused, wanted, managed, let, forced, began, appeared
Typically take an S argument (but never just an SBAR)
6 wound, grew, ended, closed, backed Usually take a particle
Did MDL make appropriate generalizations?
The learned verb classes are clearly linguistically coherent
But they don’t account for exactly which verbs can appear with which subcats
Linguists have proposed far more fine-grained classes
Data available for learning was limited (subcats had no internal structure, Penn Treebank labels may not be sufficient)
But linguists can’t explain which verbs appear with which subcats either
Conclusions
• MDL (and only MDL) can determine when to make linguistic generalizations and when not to
• The same MDL metric can be used both on small sets of example sentences and on unrestricted corpora
• Work using corpora does not address the kind of issues that syntacticians are interested in