unsupervised learning of natural language morphology using mdl
DESCRIPTION
Unsupervised Learning of Natural Language Morphology using MDL. John Goldsmith November 9, 2001. Unsupervised learning. Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation. Output. - PowerPoint PPT PresentationTRANSCRIPT
Unsupervised Learning of Unsupervised Learning of Natural Language Natural Language
Morphology using MDLMorphology using MDL
John GoldsmithJohn Goldsmith
November 9, 2001November 9, 2001
Unsupervised learningUnsupervised learning
Input: untagged text in orthographic or Input: untagged text in orthographic or phonetic formphonetic form
with spaces (or punctuation) separating with spaces (or punctuation) separating words.words.
But no tagging or text preparation.But no tagging or text preparation.
OutputOutput
List of stems, suffixes, and prefixesList of stems, suffixes, and prefixes List of signatures.List of signatures.
A signature: a list of all suffixes (prefixes) A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given appearing in a given corpus with a given stem.stem.
Hence, a stem in a corpus has a unique Hence, a stem in a corpus has a unique signature.signature.
A A signaturesignature has a unique set of stems has a unique set of stems associated with itassociated with it
……
(example of signature in English)(example of signature in English)
NULL.ed.ing.s NULL.ed.ing.s
askask callcall pointpoint
==
askask askedasked asking asking asksasks
call call calledcalled callingcalling callscalls
pointpoint pointedpointed pointingpointing pointspoints
Minimum Description Length (MDL)Minimum Description Length (MDL)
Jorma Rissanen: Stochastic Complexity in Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)Statistical Inquiry (1989)
Work by Michael Brent and Carl de Work by Michael Brent and Carl de Marcken on word-discovery using MDLMarcken on word-discovery using MDL
Essence of MDLEssence of MDL
We are We are givengiven
1.1. a corpus, anda corpus, and
2.2. a probabilistic morphology, which a probabilistic morphology, which technically means that we are given a technically means that we are given a distribution over certain strings of stems distribution over certain strings of stems and affixes.and affixes.
(“Given”? Given by who? We’ll get back to (“Given”? Given by who? We’ll get back to that.)that.)
(Remember: a (Remember: a distributiondistribution is a set of non- is a set of non-negative numbers summing to 1.0.)negative numbers summing to 1.0.)
The The higher higher the probability is that the the probability is that the morphology assigns to the (observed) morphology assigns to the (observed) corpus, the corpus, the betterbetter that morphology is as a that morphology is as a model model of of that data. that data.
Better said: Better said: -1 * log probability (corpus) is a measure of -1 * log probability (corpus) is a measure of
how well how well the morphology models the data: the morphology models the data: the the smallersmaller that number is, the better the that number is, the better the morphology models the data.morphology models the data.
This is known as the This is known as the optimal compressed optimal compressed length length of the data, given the model.of the data, given the model.
Using base 2 logs, this number is a measure Using base 2 logs, this number is a measure in information theoretic bits.in information theoretic bits.
Essence of MDL…Essence of MDL…
The goodness of the morphology is also The goodness of the morphology is also measured by how measured by how compact compact the the morphology is.morphology is.
We can measure the compactness of a We can measure the compactness of a morphology in information theoretic bits.morphology in information theoretic bits.
How can we measure the How can we measure the compactness of a morphology?compactness of a morphology?
Let’s consider a naïve version of Let’s consider a naïve version of description length: count the number of description length: count the number of letters. letters.
This naïve version is nonetheless helpful This naïve version is nonetheless helpful in seeing the intuition involved.in seeing the intuition involved.
Naive Minimum Description LengthNaive Minimum Description Length
Corpus:Corpus:
jump, jumps, jumpingjump, jumps, jumping
laugh, laughed, laugh, laughed, laughinglaughing
sing, sang, singingsing, sang, singing
the, dog, dogs the, dog, dogs
total: total: 6262 letters letters
Analysis:Analysis:
StemsStems: jump laugh sing : jump laugh sing sang dog (20 letters)sang dog (20 letters)
SuffixesSuffixes: s ing ed (6 : s ing ed (6 letters)letters)
UnanalyzedUnanalyzed: the (3 : the (3 letters)letters)
total: total: 2929 letters. letters.
Notice that the description length goes UP if we analyze sing into s+ing
Essence of MDL…Essence of MDL…
The best overall theory of a corpus is the The best overall theory of a corpus is the one for which the one for which the sumsum of of
log prob (corpus) +log prob (corpus) + length of the morphologylength of the morphology
(that’s the (that’s the description length)description length) is the is the smallestsmallest..
Essence of MDL…Essence of MDL…
0
100000
200000
300000
400000
500000
600000
700000
Best analysis Elegant theorythat works
badly
Baroque theorymodeled on
data
Length of morphology
Log prob of corpus
Overall logicOverall logic
Search through morphology space for the Search through morphology space for the morphology which provides the smallest morphology which provides the smallest description length.description length.
Corpus
Pick a large corpus from a language --5,000 to 1,000,000 words.
Corpus
Bootstrap heuristicFeed it into the “bootstrapping” heuristic...
Corpus
Out of which comes a preliminary morphology,which need not be superb.Morphology
Bootstrap heuristic
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
Feed it to the incrementalheuristics...
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
modified morphology
Out comes a modifiedmorphology.
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
modified morphology
Is the modificationan improvement?Ask MDL!
Corpus
Morphology
Bootstrap heuristic
modified morphology
If it is an improvement,replace the morphology...
Garbage
Corpus
Bootstrap heuristic
incremental heuristics
modified morphology
Send it back to theincremental heuristics again...
Morphology
incremental heuristics
modified morphology
Continue until there are no improvementsto try.
1. Bootstrap heuristic1. Bootstrap heuristic
A function that takes words as inputs A function that takes words as inputs and gives an initial hypothesis regarding and gives an initial hypothesis regarding what are stems and what are affixes.what are stems and what are affixes.
In theory, the search space is enormous: In theory, the search space is enormous: each word w of length |w| has at least |w| each word w of length |w| has at least |w| analyses, so search space has at least analyses, so search space has at least members.members.
V
iiw
1
||
Better bootstrap heuristicsBetter bootstrap heuristics
Heuristic, not perfection! Several good Heuristic, not perfection! Several good heuristics. Best is a modification of a good heuristics. Best is a modification of a good idea of Zellig Harris (1955):idea of Zellig Harris (1955):
Current variant:Current variant:
Cut words at certain Cut words at certain peakspeaks of of successor successor frequencyfrequency..
Problems: can Problems: can over-cut; over-cut; can can under-cutunder-cut;;
Successor frequencySuccessor frequency
g o v e r n
Empirically, only one letter follows “gover”: “n”
Successor frequencySuccessor frequency
g o v e r n m
Empirically, 6 letters follows “govern”: “m”
i
os
e
#
Successor frequencySuccessor frequency
g o v e r n m
Empirically, 1 letter follows “governm”: “e”
e
g o v e r 1 n 6 m 1 e
peak of successor frequency
Lots of errors…Lots of errors…
c o n s e r v a t i v e s
9 18 11 6 4 1 2 1 1 2 1 1
wrong right wrong
Even so…Even so…
We set conditions:We set conditions:
Accept cuts with stems at least 5 letters in Accept cuts with stems at least 5 letters in length;length;
Demand that successor frequency be a Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment)clear peak: 1… N … 1 (e.g. govern-ment)
Then for each stem, collect all of its suffixes Then for each stem, collect all of its suffixes into a signature; and accept only into a signature; and accept only signatures with at least 5 stems to it.signatures with at least 5 stems to it.
2. Incremental heuristics2. Incremental heuristics
Course-grained to fine-grainedCourse-grained to fine-grained 1. 1. Stems and suffixes Stems and suffixes to splitto split: :
Accept any analysis of a word if it consists of a known Accept any analysis of a word if it consists of a known stem and a known suffix.stem and a known suffix.
2. 2. Loose fitLoose fit: : suffixes and signatures suffixes and signatures to split: to split: Collect any string that precedes a known suffix. Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to Find all of its apparent suffixes, and use MDL to
decide if it’s worth it to do the analysis. We’ll return to decide if it’s worth it to do the analysis. We’ll return to this in a moment.this in a moment.
Incremental heuristicIncremental heuristic
33.Slide stem-suffix boundary to the left.Slide stem-suffix boundary to the left: : Again, use MDL to decide.Again, use MDL to decide.
How do we use MDL to decide?How do we use MDL to decide?
Using MDL to judge Using MDL to judge a potential stema potential stem
act, acted, action, acts, acting.act, acted, action, acts, acting.
We have the suffixes NULL, ed, ion, ing, and We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.ss, but no signature NULL.ed.ion.ing.s
Let’s compute Let’s compute costcost versus versus savingssavings of of signature NULL.ed.ion.ing.ssignature NULL.ed.ion.ing.s
Savings: Savings:
Stem savings: Stem savings: 44 copies of the stem copies of the stem actact: : that’s 3 x 4 = 12 letters = almost 60 bits.that’s 3 x 4 = 12 letters = almost 60 bits.
Cost of NULL.ed.ing.sCost of NULL.ed.ing.s
A pointer to each A pointer to each suffix:suffix:
][log
][log
][log
][log
][log
ing
W
s
W
ion
W
ed
W
NULL
W
To give a feel for this: 5][
log ed
W
Total cost of suffix list: about 30 bits.Cost of pointer to signature: total cost is
bitssigthisusethatstems
W13
][#log
Cost of signature: about 45 bitsCost of signature: about 45 bits Savings: Savings: about 60 bitsabout 60 bits
so MDL says: so MDL says: Do itDo it! Analyze the words as ! Analyze the words as stem + suffix.stem + suffix.
Notice that the cost of the analysis would Notice that the cost of the analysis would have been higher if one or more of the have been higher if one or more of the suffixes had not already “existed”.suffixes had not already “existed”.
Original morphology+ Compressed data
Repair heuristics: using MDLRepair heuristics: using MDL
We We couldcould compute the entire MDL in one compute the entire MDL in one state of the morphology; make a change; state of the morphology; make a change; compute the whole MDL in the proposed compute the whole MDL in the proposed (modified) state; and compared the two (modified) state; and compared the two lengths.lengths.
Revised morphology+
compressed data
<>