unsupervised learning of natural language morphology using mdl

Post on 14-Jan-2016

44 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Unsupervised Learning of Natural Language Morphology using MDL. John Goldsmith November 9, 2001. Unsupervised learning. Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation. Output. - PowerPoint PPT Presentation

TRANSCRIPT

Unsupervised Learning of Unsupervised Learning of Natural Language Natural Language

Morphology using MDLMorphology using MDL

John GoldsmithJohn Goldsmith

November 9, 2001November 9, 2001

Unsupervised learningUnsupervised learning

Input: untagged text in orthographic or Input: untagged text in orthographic or phonetic formphonetic form

with spaces (or punctuation) separating with spaces (or punctuation) separating words.words.

But no tagging or text preparation.But no tagging or text preparation.

OutputOutput

List of stems, suffixes, and prefixesList of stems, suffixes, and prefixes List of signatures.List of signatures.

A signature: a list of all suffixes (prefixes) A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given appearing in a given corpus with a given stem.stem.

Hence, a stem in a corpus has a unique Hence, a stem in a corpus has a unique signature.signature.

A A signaturesignature has a unique set of stems has a unique set of stems associated with itassociated with it

……

(example of signature in English)(example of signature in English)

NULL.ed.ing.s NULL.ed.ing.s

askask callcall pointpoint

==

askask askedasked asking asking asksasks

call call calledcalled callingcalling callscalls

pointpoint pointedpointed pointingpointing pointspoints

Minimum Description Length (MDL)Minimum Description Length (MDL)

Jorma Rissanen: Stochastic Complexity in Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)Statistical Inquiry (1989)

Work by Michael Brent and Carl de Work by Michael Brent and Carl de Marcken on word-discovery using MDLMarcken on word-discovery using MDL

Essence of MDLEssence of MDL

We are We are givengiven

1.1. a corpus, anda corpus, and

2.2. a probabilistic morphology, which a probabilistic morphology, which technically means that we are given a technically means that we are given a distribution over certain strings of stems distribution over certain strings of stems and affixes.and affixes.

(“Given”? Given by who? We’ll get back to (“Given”? Given by who? We’ll get back to that.)that.)

(Remember: a (Remember: a distributiondistribution is a set of non- is a set of non-negative numbers summing to 1.0.)negative numbers summing to 1.0.)

The The higher higher the probability is that the the probability is that the morphology assigns to the (observed) morphology assigns to the (observed) corpus, the corpus, the betterbetter that morphology is as a that morphology is as a model model of of that data. that data.

Better said: Better said: -1 * log probability (corpus) is a measure of -1 * log probability (corpus) is a measure of

how well how well the morphology models the data: the morphology models the data: the the smallersmaller that number is, the better the that number is, the better the morphology models the data.morphology models the data.

This is known as the This is known as the optimal compressed optimal compressed length length of the data, given the model.of the data, given the model.

Using base 2 logs, this number is a measure Using base 2 logs, this number is a measure in information theoretic bits.in information theoretic bits.

Essence of MDL…Essence of MDL…

The goodness of the morphology is also The goodness of the morphology is also measured by how measured by how compact compact the the morphology is.morphology is.

We can measure the compactness of a We can measure the compactness of a morphology in information theoretic bits.morphology in information theoretic bits.

How can we measure the How can we measure the compactness of a morphology?compactness of a morphology?

Let’s consider a naïve version of Let’s consider a naïve version of description length: count the number of description length: count the number of letters. letters.

This naïve version is nonetheless helpful This naïve version is nonetheless helpful in seeing the intuition involved.in seeing the intuition involved.

Naive Minimum Description LengthNaive Minimum Description Length

Corpus:Corpus:

jump, jumps, jumpingjump, jumps, jumping

laugh, laughed, laugh, laughed, laughinglaughing

sing, sang, singingsing, sang, singing

the, dog, dogs the, dog, dogs

total: total: 6262 letters letters

Analysis:Analysis:

StemsStems: jump laugh sing : jump laugh sing sang dog (20 letters)sang dog (20 letters)

SuffixesSuffixes: s ing ed (6 : s ing ed (6 letters)letters)

UnanalyzedUnanalyzed: the (3 : the (3 letters)letters)

total: total: 2929 letters. letters.

Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL…Essence of MDL…

The best overall theory of a corpus is the The best overall theory of a corpus is the one for which the one for which the sumsum of of

log prob (corpus) +log prob (corpus) + length of the morphologylength of the morphology

(that’s the (that’s the description length)description length) is the is the smallestsmallest..

Essence of MDL…Essence of MDL…

0

100000

200000

300000

400000

500000

600000

700000

Best analysis Elegant theorythat works

badly

Baroque theorymodeled on

data

Length of morphology

Log prob of corpus

Overall logicOverall logic

Search through morphology space for the Search through morphology space for the morphology which provides the smallest morphology which provides the smallest description length.description length.

Corpus

Pick a large corpus from a language --5,000 to 1,000,000 words.

Corpus

Bootstrap heuristicFeed it into the “bootstrapping” heuristic...

Corpus

Out of which comes a preliminary morphology,which need not be superb.Morphology

Bootstrap heuristic

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

Feed it to the incrementalheuristics...

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Out comes a modifiedmorphology.

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Is the modificationan improvement?Ask MDL!

Corpus

Morphology

Bootstrap heuristic

modified morphology

If it is an improvement,replace the morphology...

Garbage

Corpus

Bootstrap heuristic

incremental heuristics

modified morphology

Send it back to theincremental heuristics again...

Morphology

incremental heuristics

modified morphology

Continue until there are no improvementsto try.

1. Bootstrap heuristic1. Bootstrap heuristic

A function that takes words as inputs A function that takes words as inputs and gives an initial hypothesis regarding and gives an initial hypothesis regarding what are stems and what are affixes.what are stems and what are affixes.

In theory, the search space is enormous: In theory, the search space is enormous: each word w of length |w| has at least |w| each word w of length |w| has at least |w| analyses, so search space has at least analyses, so search space has at least members.members.

V

iiw

1

||

Better bootstrap heuristicsBetter bootstrap heuristics

Heuristic, not perfection! Several good Heuristic, not perfection! Several good heuristics. Best is a modification of a good heuristics. Best is a modification of a good idea of Zellig Harris (1955):idea of Zellig Harris (1955):

Current variant:Current variant:

Cut words at certain Cut words at certain peakspeaks of of successor successor frequencyfrequency..

Problems: can Problems: can over-cut; over-cut; can can under-cutunder-cut;;

Successor frequencySuccessor frequency

g o v e r n

Empirically, only one letter follows “gover”: “n”

Successor frequencySuccessor frequency

g o v e r n m

Empirically, 6 letters follows “govern”: “m”

i

os

e

#

Successor frequencySuccessor frequency

g o v e r n m

Empirically, 1 letter follows “governm”: “e”

e

g o v e r 1 n 6 m 1 e

peak of successor frequency

Lots of errors…Lots of errors…

c o n s e r v a t i v e s

9 18 11 6 4 1 2 1 1 2 1 1

wrong right wrong

Even so…Even so…

We set conditions:We set conditions:

Accept cuts with stems at least 5 letters in Accept cuts with stems at least 5 letters in length;length;

Demand that successor frequency be a Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment)clear peak: 1… N … 1 (e.g. govern-ment)

Then for each stem, collect all of its suffixes Then for each stem, collect all of its suffixes into a signature; and accept only into a signature; and accept only signatures with at least 5 stems to it.signatures with at least 5 stems to it.

2. Incremental heuristics2. Incremental heuristics

Course-grained to fine-grainedCourse-grained to fine-grained 1. 1. Stems and suffixes Stems and suffixes to splitto split: :

Accept any analysis of a word if it consists of a known Accept any analysis of a word if it consists of a known stem and a known suffix.stem and a known suffix.

2. 2. Loose fitLoose fit: : suffixes and signatures suffixes and signatures to split: to split: Collect any string that precedes a known suffix. Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to Find all of its apparent suffixes, and use MDL to

decide if it’s worth it to do the analysis. We’ll return to decide if it’s worth it to do the analysis. We’ll return to this in a moment.this in a moment.

Incremental heuristicIncremental heuristic

33.Slide stem-suffix boundary to the left.Slide stem-suffix boundary to the left: : Again, use MDL to decide.Again, use MDL to decide.

How do we use MDL to decide?How do we use MDL to decide?

Using MDL to judge Using MDL to judge a potential stema potential stem

act, acted, action, acts, acting.act, acted, action, acts, acting.

We have the suffixes NULL, ed, ion, ing, and We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.ss, but no signature NULL.ed.ion.ing.s

Let’s compute Let’s compute costcost versus versus savingssavings of of signature NULL.ed.ion.ing.ssignature NULL.ed.ion.ing.s

Savings: Savings:

Stem savings: Stem savings: 44 copies of the stem copies of the stem actact: : that’s 3 x 4 = 12 letters = almost 60 bits.that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.sCost of NULL.ed.ing.s

A pointer to each A pointer to each suffix:suffix:

][log

][log

][log

][log

][log

ing

W

s

W

ion

W

ed

W

NULL

W

To give a feel for this: 5][

log ed

W

Total cost of suffix list: about 30 bits.Cost of pointer to signature: total cost is

bitssigthisusethatstems

W13

][#log

Cost of signature: about 45 bitsCost of signature: about 45 bits Savings: Savings: about 60 bitsabout 60 bits

so MDL says: so MDL says: Do itDo it! Analyze the words as ! Analyze the words as stem + suffix.stem + suffix.

Notice that the cost of the analysis would Notice that the cost of the analysis would have been higher if one or more of the have been higher if one or more of the suffixes had not already “existed”.suffixes had not already “existed”.

Original morphology+ Compressed data

Repair heuristics: using MDLRepair heuristics: using MDL

We We couldcould compute the entire MDL in one compute the entire MDL in one state of the morphology; make a change; state of the morphology; make a change; compute the whole MDL in the proposed compute the whole MDL in the proposed (modified) state; and compared the two (modified) state; and compared the two lengths.lengths.

Revised morphology+

compressed data

<>

top related