Download - Morphology: Words and their Parts
![Page 1: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/1.jpg)
CS 4705
Morphology: Wordsand their Parts
CS 4705
Julia Hirschberg
![Page 2: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/2.jpg)
Words
• In formal languages, words are arbitrary strings• In natural languages, words are made up of
meaningful subunits called morphemes– Morphemes are abstract concepts denoting
entities or relationships – Morphemes may be
• Stems: the main morpheme of the word• Affixes: convey the word’s role, number, gender,
etc.• cats == cat [stem] + s [suffix]• undo == un [prefix] + do [stem]
![Page 3: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/3.jpg)
Why do we need to do Morphological Analysis?
• The study of how words are composed from smaller, meaning-bearing units (morphemes)
• Applications:– Spelling correction: referece– Hyphenation algorithms: refer-ence– Part-of-speech analysis: googler [N], googling
[V]– Text-to-speech: grapheme-to-phoneme
conversion• hothouse (/T/ or /D/)
![Page 4: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/4.jpg)
– Let’s us guess the meaning of unknown words• ‘Twas brillig and the slithy toves…
• Muggles moogled migwiches
![Page 5: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/5.jpg)
Morphotactics
• What are the ‘rules’ for constructing a word in a given language?– Pseudo-intellectual vs. *intellectual-pseudo– Rational-ize vs *ize-rational– Cretin-ous vs. *cretin-ly vs. *cretin-acious
• Possible ‘rules’– Suffixes are suffixes and prefixes are prefixes– Certain affixes attach to certain types of stems
(nouns, verbs, etc.)– Certain stems can/cannot take certain affixes
![Page 6: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/6.jpg)
• Semantics: In English, un- cannot attach to adjectives that already have a negative connotation:– Unhappy vs. *unsad– Unhealthy vs. *unsick– Unclean vs. *undirty
• Phonology: In English, -er cannot attach to words of more than two syllables– great, greater– Happy, happier– Competent, *competenter– Elegant, *eleganter– Unruly, ?unrulier
![Page 7: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/7.jpg)
• Regular– Walk, walks, walking, walked, (had) walked– Table, tables
• Irregular – Eat, eats, eating, ate, (had) eaten– Catch, catches, catching, caught, (had) caught– Cut, cuts, cutting, cut, (had) cut– Goose, geese
Regular and Irregular Morphology
![Page 8: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/8.jpg)
Morphological Parsing
• Algorithms developed to use regularities -- and known irregularities -- to parse words into their morphemes
• Cats cat +N +PL• Cat cat +N +SG• Cities city +N +PL• Merging merge +V +Present-participle
• Caught catch +V +past-participle
![Page 9: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/9.jpg)
Morphology and Finite State Automata
• We can use the machinery provided by FSAs to capture facts about morphology• Accept strings that are in the language• Reject strings that are not• Do this in a way that does not require us to list
all the words in the language
![Page 10: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/10.jpg)
How do we build a Morphological Analyzer?
• Lexicon: list of stems and affixes (w/ corresponding part of speech (p.o.s.))
• Morphotactics of the language: model of how and which morphemes can be affixed to a stem
• Orthographic rules: spelling modifications that may occur when affixation occurs– in il in context of l (in- + legal)
• Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes
![Page 11: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/11.jpg)
Some Simple Rules
• Regular singular nouns stay as is• Regular plural nouns have an -s on the end• Irregulars stay as is
![Page 12: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/12.jpg)
Simple English NP FSA
![Page 13: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/13.jpg)
Expand the Arcs with Stems and Affixes
catdog
child
geese
![Page 14: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/14.jpg)
• We can now run strings through these machines to recognize strings in the language• Accept words that are ok• Reject words that are not
• But is this enough?• We often want to know the structure of a word
(understanding/parsing)• Or we may have a stem and want to produce a surface form
(production/generation)
• Example• From “cats” to “cat +N +PL”• From “cat + N + PL” to “cats”
![Page 15: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/15.jpg)
Finite State Transducers (FSTs)
• Turning an FSA into an FST• Add another tape• Add extra symbols to the transitions• On one tape we read “cats” -- on the other we
write “cat +N +PL”• Or vice versa…
![Page 16: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/16.jpg)
Kimmo Koskenniemi’s two-level morphologyIdea: a word is a relationship between lexical level (its morphemes) and surface level (its orthography)
Koskenniemi 2-level Morphology
![Page 17: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/17.jpg)
• c:c means read a c on one tape and write a c on the other• +N:ε means read a +N symbol on one tape and write nothing on the other• +PL:s means read +PL and write an s
c:c a:a t:t +N:ε +PL:s
![Page 18: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/18.jpg)
Not So Simple
• Of course, its not all as easy as • “cat +N +PL” <-> “cats”
• What do we do about geese, mice, oxen?• Many spelling/pronunciation changes go along with
inflectional changes, e.g.• Fox and Foxes
![Page 19: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/19.jpg)
Multi-Tape Machines
• Solution for complex changes:– Add more tapes – Use output of one tape machine as input to the
next• To handle irregular spelling changes, add
intermediate tapes with intermediate symbols
![Page 20: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/20.jpg)
Example of a Multi-Tape Machine
• We use one machine to transduce between the lexical and the intermediate level, and another to transduce between the intermediate and the surface tapes
![Page 21: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/21.jpg)
FST Fragment: Lexical to Intermediate
• ^ is morpheme boundary; # is word boundary
![Page 22: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/22.jpg)
FST Fragment: Intermediate to Surface
• Rule: insert an e after a morpheme-final x, s or z and before morpheme s, eg. fox^s# foxes
![Page 23: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/23.jpg)
Putting Them Together
![Page 24: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/24.jpg)
Practical Uses
• This kind of parsing is normally called morphological analysis
• Can be • An important stand-alone component of an
application (spelling correction, information retrieval, part-of-speech tagging,…)
• Or simply a link in a chain of processing (machine translation, parsing,…)
![Page 25: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/25.jpg)
Porter Stemmer (1980)
• Standard, very popular and usable stemmer (IR, IE) – identify a word’s stem
• Sequence of cascaded rewrite rules, e.g.– IZE ε (e.g. unionize union)– CY T (e.g. frequency frequent)– ING ε , if stem contains vowel (motoring
motor)• Can be implemented as a lexicon-free FST (many
implementations available on the web)
![Page 26: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/26.jpg)
Important Note: Morphology Differs by Language
• Languages differ in how they encode morphological information– Isolating languages (e.g. Cantonese) have no
affixes: each word usually has 1 morpheme– Agglutinative languages (e.g. Finnish, Turkish)
are composed of prefixes and suffixes added to a stem (like beads on a string) – each feature realized by a single affix, e.g. Finnishepäjärjestelmällistyttämättömyydellänsäkäänköhän ‘Wonder if he can also ... with his capability of not
causing things to be unsystematic’
![Page 27: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/27.jpg)
– Polysynthetic languages (e.g. Inuit languages) express much of their syntax in their morphology, incorporating a verb’s arguments into the verb, e.g. Western Greenlandic
Aliikusersuillammassuaanerartassagaluarpaalli.aliiku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal-lientertainment-provide-SEMITRANS-one.good.at-COP-say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but'However, they will say that he is a great entertainer, but ...'
– So….different languages may require very different morphological analyzers
![Page 28: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/28.jpg)
Concatenative vs. Non-concatenative Morphology
• Semitic root-and-pattern morphology– Root (2-4 consonants) conveys basic semantics
(e.g. Arabic /ktb/)– Vowel pattern conveys voice and aspect– Derivational template (binyan) identifies word
class
![Page 29: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/29.jpg)
Template Vowel Pattern
active passive
CVCVC katab kutib write
CVCCVC kattab kuttib cause to write
CVVCVC ka:tab ku:tib correspond
tVCVVCVC taka:tab tuku:tib write each other
nCVVCVC nka:tab nku:tib subscribe
CtVCVC ktatab ktutib write
stVCCVC staktab stuktib dictate
![Page 30: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/30.jpg)
Morphological Representations: Evidence from Human Performance
• Hypotheses:– Full listing hypothesis: words listed – Minimum redundancy hypothesis:
morphemes listed• Experimental evidence:
– Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest something in between
• Regularly inflected forms (e.g. cars) prime stem (car) but not derived forms (e.g. management, manage)
![Page 31: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/31.jpg)
• But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart)
• Speech errors suggest affixes must be represented separately in the mental lexicon– ‘easy enoughly’ for ‘easily enough’
• Importance of morphological family size– Larger families faster recognition
![Page 32: Morphology: Words and their Parts](https://reader036.vdocument.in/reader036/viewer/2022081508/56814514550346895db1d6fa/html5/thumbnails/32.jpg)
Summing Up
• Regular expressions and FSAs can represent subsets of natural language as well as regular languages– Both representations may be difficult for humans to
understand for any real subset of a language
– Can be hard to scale up: e.g., when many choices at any point (e.g. surnames)
– But quick, powerful and easy to use for small problems
– AT&T Finite State Toolkit does scale
• Next class: – Read Ch 4 on Ngrams
– HW1 will be due at midnight on Oct 1