inducing the morphological lexicon of a natural language from unannotated text
DESCRIPTION
nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Inducing the Morphological Lexicon of a Natural Language from Unannotated Text. { Mathias . Creutz , Krista . Lagus }@hut.fi - PowerPoint PPT PresentationTRANSCRIPT
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Inducing the Morphological Lexicon of a Natural Language from Unannotated
Text
{ Mathias.Creutz, Krista.Lagus }@hut.fi
International and Interdisciplinary Conference on Adaptive Knowledge Representation and
Reasoning (AKRR’05)Espoo, 17 June 2005
kahvi + n + juo + ja + lle + kin
nyky + ratkaisu + i + sta + mme
tietä + isi + mme + kö + hän
open + mind + ed + ness un + believ + able
17 June 2005Mathias Creutz 2
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Challenge for NLP: too many words• E.g., Finnish words often consist of lengthy
sequences of morphemes — stems, suffixes and prefixes:– kahvi + n + juo + ja + lle + kin
(coffee + of + drink + -er + for + also)
– nyky + ratkaisu + i + sta + mme(current + solution + -s + from + our)
– tietä + isi + mme + kö + hän(know + would + we + INTERR + indeed)
Huge number of different possible word forms Important to know the inner structure of words The number of morphemes per word varies much
17 June 2005Mathias Creutz 3
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Goal
• Learn representations of– the smallest individually meaningful units of
language (morphemes)– and their interaction– in an unsupervised and data-driven manner
from raw text– making as general and language-independent
assumptions as possible.
Morfessor
17 June 2005Mathias Creutz 4
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
State of the art• Rule-based systems
– accurate, language-dependent, adaptivity issues
• Unsupervised word segmentation– sentences can be of different length– context-insensitive poor modeling of syntax:
• undersegmentation of frequent strings (“forthepurposeof”)
• oversegmentation of rare strings (“in + s + an + e”)
• no syntactic / morphotactic constraints (“s + can”)
MorfessorBaseline
17 June 2005Mathias Creutz 5
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
State of the art (cont’d)• Morphology learning
– Beyond segmentation: allomorphy (“foot – feet, goose – geese”)
– Detection of semantic similarity (e.g., Yarowsky &
Wicentowski) (“sing – sings – singe – singed”)
– Learning of paradigms (e.g., John Goldsmith’s Linguistica)
believhopliv
movus
eedesing
Very restricted syntax / morphotactics in terms of number of morphemes per word form!
17 June 2005Mathias Creutz 6
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Morfessor with morpheme categories• Lexicon / Grammar dualism
– Word structure captured by a regular expression: word = ( prefix* stem suffix* )+
– Morph sequences (words) are generated by a Hidden Markov model:
P(STM | PRE) P(SUF | SUF)
ificover ationsimpl# s #
P(’s’ | SUF)P(’over’ | PRE)
Transition probs
Emission probs
17 June 2005Mathias Creutz 7
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Lexicon“Meaning” “Form”
14029
136 1 4 over
41 4 1 5 simpl
17259
1 4618 1 s
Freq
uency
Length
String
...
Right p
erplex
ity
Left
perplex
ity
Morp
hs
17 June 2005Mathias Creutz 8
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
How meaning affects morphotactic role
0
0,2
0,4
0,6
0,8
1
1,2
10 30 50 70 90
Left perplexity
Suffix-likeness0
0,2
0,4
0,6
0,8
1
1,2
10 30 50 70 90
Right perplexity
Prefix-likeness0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8 9 1
Morph length
Stem-likeness
• Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)
• Assume asymmetries between the categories:
17 June 2005Mathias Creutz 9
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
How meaning affects role (cont’d) • There is an additional non-morpheme
category for cases where none of the proper classes is likely:
€
P(NON |'over') =
1− Prefixlike('over')[ ] ⋅ 1− Stemlike('over')[ ]
⋅1− Suffixlike('over')[ ]
€
P(PRE |'over') =Prefixlike('over')q ⋅ 1− P(NON |'over')[ ]
Prefixlike('over')q + Stemlike('over')q + Suffixlike('over')q
• Distribute remaining probability mass proportionally, e.g.,
17 June 2005Mathias Creutz 10
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Maximum a posteriori optimization
€
argmaxLexicon
P(Lexicon | Corpus) =
argmaxLexicon
P(Corpus | Lexicon) ⋅P(Lexicon)
Morfessor Categories-MAP:Older maximum-
likelihood version:Categories-ML
(lexicon controlledheuristically)
14029
136 1 4 over
41 4 1 5 simpl
17259
1 4618 1 s
...
P(STM | PRE) P(SUF | SUF)
ificover ationsimpl# s #
P(’s’ | SUF)P(’over’ | PRE)
Balance accuracy of representation of data against size of lexicon
17 June 2005Mathias Creutz 11
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Over- and undersegmentation still a problem?
€
P('morgana') = P(Freq =1) ⋅P(RightPpl =1) ⋅P(LeftPpl =1) ⋅P(Length = 7) ⋅
P('m') ⋅P('o') ⋅P('r') ⋅P('g') ⋅P('a') ⋅P('n') ⋅P('a')
• Probability of adding an entry to the lexicon:
Rare strings are split into smaller parts (e.g., morgan + a)
hands# #hand# #s
• Probability of sequences in the corpus:
vs.
Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)
17 June 2005Mathias Creutz 12
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Solution: Hierarchical structures in lexicon
oppositio kansanedustaja+
op positio kansan edustaja
kansa edusta jan
Non-morpheme Stem
Suffix• Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation.• Do not expand morphs consisting of non-morphemes.
17 June 2005Mathias Creutz 13
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)
• Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs
• Covers– 1.4 million Finnish word forms– 120 000 English word forms
• Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme
Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.
17 June 2005Mathias Creutz 14
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
50
60
70
80
10 50 250 12000
Corpus size [1000 words]
F-measure [%]30
40
50
60
70
80
10 50 250 16000
Corpus size [1000 words]
F-measure [%]
Evaluation against the Hutmegs Gold Standard
Finnish English
Ctxt-insens. (Baseline)Paradigms
(Linguistica)
Heuristic (Categories-ML)Categories-MAP
17 June 2005Mathias Creutz 15
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Example segmentationsFinnish English
[ aarre kammio ] issa [ accomplish es ]
[ aarre kammio ] on [ accomplish ment ]
bahama laiset [ beautiful ly ]
bahama [ saari en ] [ insur ed ]
[ epä [ [ tasa paino ] inen ] ]
[ insure s ]
maclare n [ insur ing ]
[ nais [ autoili ja ] ] a [ [ [ photo graph ] er ] s ]
[ sano ttiin ] ko [ present ly ] found
töhri ( mis istä ) [ re siding ]
[ [ voi mme ] ko ] [ [ un [ expect ed ] ] ly ]
17 June 2005Mathias Creutz 16
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Discussion
• Possibility to extend the model– rudimentary features used for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –
geese)– allomorphy
(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)
• Already now useful in applications– automatic speech recognition (Finnish, Turkish)
17 June 2005Mathias Creutz 17
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Morpho project pagehttp://www.cis.hut.fi/projects/morpho/
17 June 2005Mathias Creutz 18
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Demo 6
http://www.cis.hut.fi/projects/morpho/
17 June 2005Mathias Creutz 19
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Demo 7