inducing the morphological lexicon of a natural language from unannotated text

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Inducing the Morphological Lexicon of a Natural Language from Unannotated

Text

{ Mathias.Creutz, Krista.Lagus }@hut.fi

International and Interdisciplinary Conference on Adaptive Knowledge Representation and

Reasoning (AKRR’05)Espoo, 17 June 2005

kahvi + n + juo + ja + lle + kin

nyky + ratkaisu + i + sta + mme

tietä + isi + mme + kö + hän

open + mind + ed + ness un + believ + able

17 June 2005Mathias Creutz 2



Challenge for NLP: too many words• E.g., Finnish words often consist of lengthy

sequences of morphemes — stems, suffixes and prefixes:– kahvi + n + juo + ja + lle + kin

(coffee + of + drink + -er + for + also)

– nyky + ratkaisu + i + sta + mme(current + solution + -s + from + our)

– tietä + isi + mme + kö + hän(know + would + we + INTERR + indeed)

Huge number of different possible word forms Important to know the inner structure of words The number of morphemes per word varies much




Goal

• Learn representations of– the smallest individually meaningful units of

language (morphemes)– and their interaction– in an unsupervised and data-driven manner

from raw text– making as general and language-independent

assumptions as possible.

Morfessor




State of the art• Rule-based systems

– accurate, language-dependent, adaptivity issues

• Unsupervised word segmentation– sentences can be of different length– context-insensitive poor modeling of syntax:

• undersegmentation of frequent strings (“forthepurposeof”)

• oversegmentation of rare strings (“in + s + an + e”)

• no syntactic / morphotactic constraints (“s + can”)

MorfessorBaseline




State of the art (cont’d)• Morphology learning

– Beyond segmentation: allomorphy (“foot – feet, goose – geese”)

– Detection of semantic similarity (e.g., Yarowsky &

Wicentowski) (“sing – sings – singe – singed”)

– Learning of paradigms (e.g., John Goldsmith’s Linguistica)

believhopliv

movus

eedesing

Very restricted syntax / morphotactics in terms of number of morphemes per word form!




Morfessor with morpheme categories• Lexicon / Grammar dualism

– Word structure captured by a regular expression: word = ( prefix* stem suffix* )+

– Morph sequences (words) are generated by a Hidden Markov model:

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Transition probs

Emission probs




Lexicon“Meaning” “Form”

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

Freq

uency

Length

String

...

Right p

erplex

ity

Left

perplex

ity

Morp

hs




How meaning affects morphotactic role

0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Left perplexity

Suffix-likeness0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Right perplexity

Prefix-likeness0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7 8 9 1

Morph length

Stem-likeness

• Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)

• Assume asymmetries between the categories:




How meaning affects role (cont’d) • There is an additional non-morpheme

category for cases where none of the proper classes is likely:

€

P(NON |'over') =

1− Prefixlike('over')[ ] ⋅ 1− Stemlike('over')[ ]

⋅1− Suffixlike('over')[ ]

€

P(PRE |'over') =Prefixlike('over')q ⋅ 1− P(NON |'over')[ ]

Prefixlike('over')q + Stemlike('over')q + Suffixlike('over')q

• Distribute remaining probability mass proportionally, e.g.,




Maximum a posteriori optimization

€

argmaxLexicon

P(Lexicon | Corpus) =

argmaxLexicon

P(Corpus | Lexicon) ⋅P(Lexicon)

Morfessor Categories-MAP:Older maximum-

likelihood version:Categories-ML

(lexicon controlledheuristically)

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

...

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Balance accuracy of representation of data against size of lexicon




Over- and undersegmentation still a problem?

€

P('morgana') = P(Freq =1) ⋅P(RightPpl =1) ⋅P(LeftPpl =1) ⋅P(Length = 7) ⋅

P('m') ⋅P('o') ⋅P('r') ⋅P('g') ⋅P('a') ⋅P('n') ⋅P('a')

• Probability of adding an entry to the lexicon:

Rare strings are split into smaller parts (e.g., morgan + a)

hands# #hand# #s

• Probability of sequences in the corpus:

vs.

Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)




Solution: Hierarchical structures in lexicon

oppositio kansanedustaja+

op positio kansan edustaja

kansa edusta jan

Non-morpheme Stem

Suffix• Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation.• Do not expand morphs consisting of non-morphemes.




Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)

• Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs

• Covers– 1.4 million Finnish word forms– 120 000 English word forms

• Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme

Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.




50

60

70

80

10 50 250 12000

Corpus size [1000 words]

F-measure [%]30

40

50

60

70

80

10 50 250 16000

Corpus size [1000 words]

F-measure [%]

Evaluation against the Hutmegs Gold Standard

Finnish English

Ctxt-insens. (Baseline)Paradigms

(Linguistica)

Heuristic (Categories-ML)Categories-MAP




Example segmentationsFinnish English

[ aarre kammio ] issa [ accomplish es ]

[ aarre kammio ] on [ accomplish ment ]

bahama laiset [ beautiful ly ]

bahama [ saari en ] [ insur ed ]

[ epä [ [ tasa paino ] inen ] ]

[ insure s ]

maclare n [ insur ing ]

[ nais [ autoili ja ] ] a [ [ [ photo graph ] er ] s ]

[ sano ttiin ] ko [ present ly ] found

töhri ( mis istä ) [ re siding ]

[ [ voi mme ] ko ] [ [ un [ expect ed ] ] ly ]




Discussion

• Possibility to extend the model– rudimentary features used for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –

geese)– allomorphy

(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)

• Already now useful in applications– automatic speech recognition (Finnish, Turkish)




Morpho project pagehttp://www.cis.hut.fi/projects/morpho/




Demo 6

http://www.cis.hut.fi/projects/morpho/




Demo 7

inducing the morphological lexicon of a natural language from unannotated text

Documents