unsupervised learning of natural language morphology using mdl

35
Unsupervised Unsupervised Learning of Natural Learning of Natural Language Morphology Language Morphology using MDL using MDL John Goldsmith John Goldsmith November 9, 2001 November 9, 2001

Upload: marcos

Post on 14-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Unsupervised Learning of Natural Language Morphology using MDL. John Goldsmith November 9, 2001. Unsupervised learning. Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation. Output. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unsupervised Learning of Natural Language Morphology using MDL

Unsupervised Learning of Unsupervised Learning of Natural Language Natural Language

Morphology using MDLMorphology using MDL

John GoldsmithJohn Goldsmith

November 9, 2001November 9, 2001

Page 2: Unsupervised Learning of Natural Language Morphology using MDL

Unsupervised learningUnsupervised learning

Input: untagged text in orthographic or Input: untagged text in orthographic or phonetic formphonetic form

with spaces (or punctuation) separating with spaces (or punctuation) separating words.words.

But no tagging or text preparation.But no tagging or text preparation.

Page 3: Unsupervised Learning of Natural Language Morphology using MDL

OutputOutput

List of stems, suffixes, and prefixesList of stems, suffixes, and prefixes List of signatures.List of signatures.

A signature: a list of all suffixes (prefixes) A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given appearing in a given corpus with a given stem.stem.

Hence, a stem in a corpus has a unique Hence, a stem in a corpus has a unique signature.signature.

A A signaturesignature has a unique set of stems has a unique set of stems associated with itassociated with it

……

Page 4: Unsupervised Learning of Natural Language Morphology using MDL

(example of signature in English)(example of signature in English)

NULL.ed.ing.s NULL.ed.ing.s

askask callcall pointpoint

==

askask askedasked asking asking asksasks

call call calledcalled callingcalling callscalls

pointpoint pointedpointed pointingpointing pointspoints

Page 5: Unsupervised Learning of Natural Language Morphology using MDL

Minimum Description Length (MDL)Minimum Description Length (MDL)

Jorma Rissanen: Stochastic Complexity in Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)Statistical Inquiry (1989)

Work by Michael Brent and Carl de Work by Michael Brent and Carl de Marcken on word-discovery using MDLMarcken on word-discovery using MDL

Page 6: Unsupervised Learning of Natural Language Morphology using MDL

Essence of MDLEssence of MDL

We are We are givengiven

1.1. a corpus, anda corpus, and

2.2. a probabilistic morphology, which a probabilistic morphology, which technically means that we are given a technically means that we are given a distribution over certain strings of stems distribution over certain strings of stems and affixes.and affixes.

(“Given”? Given by who? We’ll get back to (“Given”? Given by who? We’ll get back to that.)that.)

(Remember: a (Remember: a distributiondistribution is a set of non- is a set of non-negative numbers summing to 1.0.)negative numbers summing to 1.0.)

Page 7: Unsupervised Learning of Natural Language Morphology using MDL

The The higher higher the probability is that the the probability is that the morphology assigns to the (observed) morphology assigns to the (observed) corpus, the corpus, the betterbetter that morphology is as a that morphology is as a model model of of that data. that data.

Better said: Better said: -1 * log probability (corpus) is a measure of -1 * log probability (corpus) is a measure of

how well how well the morphology models the data: the morphology models the data: the the smallersmaller that number is, the better the that number is, the better the morphology models the data.morphology models the data.

This is known as the This is known as the optimal compressed optimal compressed length length of the data, given the model.of the data, given the model.

Using base 2 logs, this number is a measure Using base 2 logs, this number is a measure in information theoretic bits.in information theoretic bits.

Page 8: Unsupervised Learning of Natural Language Morphology using MDL

Essence of MDL…Essence of MDL…

The goodness of the morphology is also The goodness of the morphology is also measured by how measured by how compact compact the the morphology is.morphology is.

We can measure the compactness of a We can measure the compactness of a morphology in information theoretic bits.morphology in information theoretic bits.

Page 9: Unsupervised Learning of Natural Language Morphology using MDL

How can we measure the How can we measure the compactness of a morphology?compactness of a morphology?

Let’s consider a naïve version of Let’s consider a naïve version of description length: count the number of description length: count the number of letters. letters.

This naïve version is nonetheless helpful This naïve version is nonetheless helpful in seeing the intuition involved.in seeing the intuition involved.

Page 10: Unsupervised Learning of Natural Language Morphology using MDL

Naive Minimum Description LengthNaive Minimum Description Length

Corpus:Corpus:

jump, jumps, jumpingjump, jumps, jumping

laugh, laughed, laugh, laughed, laughinglaughing

sing, sang, singingsing, sang, singing

the, dog, dogs the, dog, dogs

total: total: 6262 letters letters

Analysis:Analysis:

StemsStems: jump laugh sing : jump laugh sing sang dog (20 letters)sang dog (20 letters)

SuffixesSuffixes: s ing ed (6 : s ing ed (6 letters)letters)

UnanalyzedUnanalyzed: the (3 : the (3 letters)letters)

total: total: 2929 letters. letters.

Notice that the description length goes UP if we analyze sing into s+ing

Page 11: Unsupervised Learning of Natural Language Morphology using MDL

Essence of MDL…Essence of MDL…

The best overall theory of a corpus is the The best overall theory of a corpus is the one for which the one for which the sumsum of of

log prob (corpus) +log prob (corpus) + length of the morphologylength of the morphology

(that’s the (that’s the description length)description length) is the is the smallestsmallest..

Page 12: Unsupervised Learning of Natural Language Morphology using MDL

Essence of MDL…Essence of MDL…

0

100000

200000

300000

400000

500000

600000

700000

Best analysis Elegant theorythat works

badly

Baroque theorymodeled on

data

Length of morphology

Log prob of corpus

Page 13: Unsupervised Learning of Natural Language Morphology using MDL

Overall logicOverall logic

Search through morphology space for the Search through morphology space for the morphology which provides the smallest morphology which provides the smallest description length.description length.

Page 14: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Pick a large corpus from a language --5,000 to 1,000,000 words.

Page 15: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Bootstrap heuristicFeed it into the “bootstrapping” heuristic...

Page 16: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Out of which comes a preliminary morphology,which need not be superb.Morphology

Bootstrap heuristic

Page 17: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

Feed it to the incrementalheuristics...

Page 18: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Out comes a modifiedmorphology.

Page 19: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Is the modificationan improvement?Ask MDL!

Page 20: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Morphology

Bootstrap heuristic

modified morphology

If it is an improvement,replace the morphology...

Garbage

Page 21: Unsupervised Learning of Natural Language Morphology using MDL

Corpus

Bootstrap heuristic

incremental heuristics

modified morphology

Send it back to theincremental heuristics again...

Page 22: Unsupervised Learning of Natural Language Morphology using MDL

Morphology

incremental heuristics

modified morphology

Continue until there are no improvementsto try.

Page 23: Unsupervised Learning of Natural Language Morphology using MDL

1. Bootstrap heuristic1. Bootstrap heuristic

A function that takes words as inputs A function that takes words as inputs and gives an initial hypothesis regarding and gives an initial hypothesis regarding what are stems and what are affixes.what are stems and what are affixes.

In theory, the search space is enormous: In theory, the search space is enormous: each word w of length |w| has at least |w| each word w of length |w| has at least |w| analyses, so search space has at least analyses, so search space has at least members.members.

V

iiw

1

||

Page 24: Unsupervised Learning of Natural Language Morphology using MDL

Better bootstrap heuristicsBetter bootstrap heuristics

Heuristic, not perfection! Several good Heuristic, not perfection! Several good heuristics. Best is a modification of a good heuristics. Best is a modification of a good idea of Zellig Harris (1955):idea of Zellig Harris (1955):

Current variant:Current variant:

Cut words at certain Cut words at certain peakspeaks of of successor successor frequencyfrequency..

Problems: can Problems: can over-cut; over-cut; can can under-cutunder-cut;;

Page 25: Unsupervised Learning of Natural Language Morphology using MDL

Successor frequencySuccessor frequency

g o v e r n

Empirically, only one letter follows “gover”: “n”

Page 26: Unsupervised Learning of Natural Language Morphology using MDL

Successor frequencySuccessor frequency

g o v e r n m

Empirically, 6 letters follows “govern”: “m”

i

os

e

#

Page 27: Unsupervised Learning of Natural Language Morphology using MDL

Successor frequencySuccessor frequency

g o v e r n m

Empirically, 1 letter follows “governm”: “e”

e

g o v e r 1 n 6 m 1 e

peak of successor frequency

Page 28: Unsupervised Learning of Natural Language Morphology using MDL

Lots of errors…Lots of errors…

c o n s e r v a t i v e s

9 18 11 6 4 1 2 1 1 2 1 1

wrong right wrong

Page 29: Unsupervised Learning of Natural Language Morphology using MDL

Even so…Even so…

We set conditions:We set conditions:

Accept cuts with stems at least 5 letters in Accept cuts with stems at least 5 letters in length;length;

Demand that successor frequency be a Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment)clear peak: 1… N … 1 (e.g. govern-ment)

Then for each stem, collect all of its suffixes Then for each stem, collect all of its suffixes into a signature; and accept only into a signature; and accept only signatures with at least 5 stems to it.signatures with at least 5 stems to it.

Page 30: Unsupervised Learning of Natural Language Morphology using MDL

2. Incremental heuristics2. Incremental heuristics

Course-grained to fine-grainedCourse-grained to fine-grained 1. 1. Stems and suffixes Stems and suffixes to splitto split: :

Accept any analysis of a word if it consists of a known Accept any analysis of a word if it consists of a known stem and a known suffix.stem and a known suffix.

2. 2. Loose fitLoose fit: : suffixes and signatures suffixes and signatures to split: to split: Collect any string that precedes a known suffix. Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to Find all of its apparent suffixes, and use MDL to

decide if it’s worth it to do the analysis. We’ll return to decide if it’s worth it to do the analysis. We’ll return to this in a moment.this in a moment.

Page 31: Unsupervised Learning of Natural Language Morphology using MDL

Incremental heuristicIncremental heuristic

33.Slide stem-suffix boundary to the left.Slide stem-suffix boundary to the left: : Again, use MDL to decide.Again, use MDL to decide.

How do we use MDL to decide?How do we use MDL to decide?

Page 32: Unsupervised Learning of Natural Language Morphology using MDL

Using MDL to judge Using MDL to judge a potential stema potential stem

act, acted, action, acts, acting.act, acted, action, acts, acting.

We have the suffixes NULL, ed, ion, ing, and We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.ss, but no signature NULL.ed.ion.ing.s

Let’s compute Let’s compute costcost versus versus savingssavings of of signature NULL.ed.ion.ing.ssignature NULL.ed.ion.ing.s

Savings: Savings:

Stem savings: Stem savings: 44 copies of the stem copies of the stem actact: : that’s 3 x 4 = 12 letters = almost 60 bits.that’s 3 x 4 = 12 letters = almost 60 bits.

Page 33: Unsupervised Learning of Natural Language Morphology using MDL

Cost of NULL.ed.ing.sCost of NULL.ed.ing.s

A pointer to each A pointer to each suffix:suffix:

][log

][log

][log

][log

][log

ing

W

s

W

ion

W

ed

W

NULL

W

To give a feel for this: 5][

log ed

W

Total cost of suffix list: about 30 bits.Cost of pointer to signature: total cost is

bitssigthisusethatstems

W13

][#log

Page 34: Unsupervised Learning of Natural Language Morphology using MDL

Cost of signature: about 45 bitsCost of signature: about 45 bits Savings: Savings: about 60 bitsabout 60 bits

so MDL says: so MDL says: Do itDo it! Analyze the words as ! Analyze the words as stem + suffix.stem + suffix.

Notice that the cost of the analysis would Notice that the cost of the analysis would have been higher if one or more of the have been higher if one or more of the suffixes had not already “existed”.suffixes had not already “existed”.

Page 35: Unsupervised Learning of Natural Language Morphology using MDL

Original morphology+ Compressed data

Repair heuristics: using MDLRepair heuristics: using MDL

We We couldcould compute the entire MDL in one compute the entire MDL in one state of the morphology; make a change; state of the morphology; make a change; compute the whole MDL in the proposed compute the whole MDL in the proposed (modified) state; and compared the two (modified) state; and compared the two lengths.lengths.

Revised morphology+

compressed data

<>