cs11-737: multilingual natural language processing

25
CS11-737: Multilingual Natural Language Processing Yulia Tsvetkov Morphological Analysis and Inflection

Upload: others

Post on 03-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS11-737: Multilingual Natural Language Processing

CS11-737: Multilingual Natural Language Processing

Yulia Tsvetkov

Morphological Analysis and Inflection

Page 2: CS11-737: Multilingual Natural Language Processing

What is a word

Bob’s handy man is a do-it-yourself kinda guy, isn’t he?

Page 3: CS11-737: Multilingual Natural Language Processing

Morphology

The study of the formation and internal structure of words

Page 4: CS11-737: Multilingual Natural Language Processing

Morpheme

Image from Lori Levin and David R. Mortensen’s draft book “Human Languages for Artificial Intelligence”

Page 5: CS11-737: Multilingual Natural Language Processing

Words are made of morphemes

Bob’s handy man is a do-it-yourself kinda guy, isn’t he?

freemorpheme

boundmorphemes

Example by Austin Matthews

Page 6: CS11-737: Multilingual Natural Language Processing

Morphological processes

● concatenation● affixation = stem+affix

○ prefix○ suffix

● non-concatenative affixation○ infix

● compounding = stem+stem

stemprefix + stemprefix + stem + suffix=circumfixation

=

Page 7: CS11-737: Multilingual Natural Language Processing

Tagalog

● Tagalog○ stem - bundok ○ singular - mabundok○ plural - mabubundok○ gloss - ‘mountainous’

Example from Lori Levin and David R. Mortensen’s draft book “Human Languages for Artificial Intelligence”

Page 8: CS11-737: Multilingual Natural Language Processing

Arabic, Chinese

● Arabic○ root and pattern morphology

● Chinese○ compound words

Page 9: CS11-737: Multilingual Natural Language Processing

Morphological functions

● Derivational morphemes ○ bound morphemes used to create new words ○ is these affixes are attached to a new base, the

resulting combination yields a word with a new meaning

○ often derived word belongs to a different syntactic class

● Inflectional morphemes○ bound morphemes used to mark grammatical

distinctions○ change the form but not POS tag or the key meaning

of the word

=

Page 10: CS11-737: Multilingual Natural Language Processing

Interlinear glossed text (IGT)

● https://www.eva.mpg.de/lingua/resources/glossing-rules.php

Page 11: CS11-737: Multilingual Natural Language Processing

Interlinear glossed text (IGT)

● https://www.eva.mpg.de/lingua/resources/glossing-rules.php

Page 12: CS11-737: Multilingual Natural Language Processing

Types of morphological categories and functions

1. Nounsa. NUMBER: Singular, Dual, Pluralb. GENDER (natural & grammatical): Masculine, Feminine, Neuter (Animate, Vegetable; AND AGREEMENTc. DEFINITENESS: Definite, Indefinited. POSSESSION: 1st, 2nd, 3rd; Singular & Plurale. NOUN CLASS (Grammatical gender): Declension types I, II, III, etc.f. CASE PARADIGM (DECLENSION)

2. Adjectivesa. RELATIONAL : QUALITATIVE : DEFECTIVEb. DEGREE: Comparative and Superlative

3. Verbsa. TRANSITIVITY: Transitive, Intransitiveb. ASPECT: Perfective, Imperfectivec. TENSE: Distant Past, Past, Present, Future, Distant Futured. VOICE: Active, Passive e. MOOD: Indicative, Imperative, Subjunctivef. Conjugation Class: I, II, III Conjugations and Conjugations: 1st, 2nd, 3rd Person, Sg, Pl Agreement

Page 13: CS11-737: Multilingual Natural Language Processing

Morphological typology

● Isolating or Analytic○ Vietnamese, Chinese, English

● Synthetic○ Fusional or Flexional

■ German, Greek, Russian■ Templatic: Hebrew and Arabic

○ Agglutinative or Agglutinating■ Finnish, Turkish, Malayalam, Swahili

○ Polysynthetic ■ Inuit, Yupik

Page 14: CS11-737: Multilingual Natural Language Processing

(Cettolo, Girardi, & Federico, 2012)

Type-token curves

Page 15: CS11-737: Multilingual Natural Language Processing

Why is rich morphology a challenge for NLP?

● High type-token ratio due to the large variety of grammatical features expressed with morphology

○ This leads to the lexical sparsity and out-of-vocabulary words

● In language generation long-range relations between words need to be enforced for modeling morphological agreement

○ This leads to agreement errors

● Morphological properties vary across languages and language families, and mapping of morphological features across languages is a challenge

○ This is exacerbated by the variability of morphological rules and irregularities (e.g. dance → danced → danced but eat → ate → eaten)

○ This leads to problems in transfer learning, translation errors, and biases in translation

Page 16: CS11-737: Multilingual Natural Language Processing

Types of morphological processing

● Analysis○ morphological parsing○ morphological segmentation

● Generation○ inflection generation ○ paradigm completion

● Acquisition of inflectional morphology

Page 17: CS11-737: Multilingual Natural Language Processing

Morphological analysis

Page 18: CS11-737: Multilingual Natural Language Processing

Morphological analysis with FSTs

Page 19: CS11-737: Multilingual Natural Language Processing

Morphological analysis with RNNs

Canonical segmentation

a. surface segmentation: achievability → achiev+abil+ity

b. canonical segmentation achievability → achieve+able+ity

1. Character bidirectional GRU encoder with attention

2. GRU decoder produces output characters3. Neural reranker for segments to identify

canonical segments

Page 20: CS11-737: Multilingual Natural Language Processing

Evaluation of morphological analysis

● Error rate● Edit distance ● Morpheme F1

Page 21: CS11-737: Multilingual Natural Language Processing

UniMorph

https://unimorph.github.io

Page 22: CS11-737: Multilingual Natural Language Processing

1. Inflection generation

2. Paradigm completion

Morphological generation

Page 23: CS11-737: Multilingual Natural Language Processing

The SIGMORPHON shared tasks

● Cross-lingual transfer for morphological inflection● Morphological analysis in context● Morphological paradigm completion

Page 24: CS11-737: Multilingual Natural Language Processing

Morphological inflection generation

Page 25: CS11-737: Multilingual Natural Language Processing

Paper for class discussion

● https://www.aclweb.org/anthology/D19-1091.pdf● Read the paper● Provide critique to a part of the paper (e.g., focusing on an individual

component of proposed model architecture or a part of experimental setup)● Propose directions for follow-up work