the syntax-morphology interface and natural language processing veronika vincze university of szeged...

71
The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary [email protected] Thematic Training Course on Processing Morphologically Rich Languages 11-15 April 2011

Upload: galilea-levick

Post on 31-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

The Syntax-Morphology Interface and Natural Language Processing

Veronika Vincze

University of Szeged

Hungary

[email protected]

Thematic Training Course on Processing Morphologically Rich Languages

11-15 April 2011

Page 2: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Outline

• Introduction• Syntax vs. morphology from a linguistic

viewpoint• Morphological coding systems in Hungarian• Morphosyntactic information in Hungarian

corpora• Language-specific morphosyntactic problems• Effects on IE, NER and MT

Page 3: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Syntax vs. morphology

• Typological differences among languages

• Agglutinative lg: role of morphology is stronger (lot of information in morphemes)

• Isolating lg: role of syntax is stronger (less morphemes, more constructions)

• Focus on Hungarian (agglutinative) and English (fusional/isolating)

Page 4: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Basic Hungarian syntax

• Lot of information encoded in morphemes• No fixed word order• Information structure is reflected in word order (theme-

rheme, old-new)Péter szereti Marit. Peter love-3SgObj Mary-ACC ‘Peter

loves Mary.’Péter Marit szereti. ‘It is Mary who Peter loves.’Marit szereti Péter. ‘It is Mary who Peter loves.’Marit Péter szereti. ‘It is Peter who loves Mary.’Szereti Péter Marit. ‘Peter LOVES Mary (and not hates).’Szereti Marit Péter. ‘Peter LOVES Mary (and not hates).’

Page 5: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Morphosyntactic features of Hungarian

• Nominal declination (nouns, adjectives, numerals)

• Verbal conjugation

• Several hundreds of word forms for each lemma

• Grammatical relations encoded primarily by morphemes -> morpho + syntactic

Page 6: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Nominal suffixes

A stem can be extended by:• Derivational suffixes• Plural• Possessive• Case suffixes

hat-ás-a-i-nak ‘to its effects’stem-DERIV.SUFF-POSS-POSS.PL-DATegész-ség-ed-re ‘cheers’stem-DERIV.SUFF-POSS.Sg2-SUB

Page 7: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Case suffixes in Hungarian

• ~20 cases („rare” cases are not always counted: distributive-temporal (-nte), associative (-stul/-stül…))

• always at the right end of the word form

• grammatical relations are encoded:– Arguments of the verb– Adjuncts (temporal and locative adverbials)

Page 8: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

…and in English

Pisti szerdánként edzésre jár.

Steve Wednesday-DIST-TEMP training-SUB go-3Sg

Each Wednesday Steve goes to training.

Szerdánként – each Wednesday

Edzésre – to training

Page 9: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Pisti bort iszik.

Steve wine-ACC drink-3Sg

Steve is drinking wine.

Pisti-NOM – Steve – subject

Bort – wine - object

Page 10: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Possessive in Hungarian

• A fiú kutyája• The boy dog-POSS• The boy’s dog• A(z ő) kutyája• The (he) dog-POSS• His dog

• Possessor in nominative• Possessed with a

possessive marker

• A fiúnak a kutyája• The boy-DAT the dog-

POSS

• Possessor in dative• Possessed with a

possessive marker

Page 11: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

…and in English

• The boy’s dog• His dog

• Possessor with a possessive marker (pronoun)

• Possessed with no marker

• The dog of the boy

• Possessive relation is marked by a preposition

Page 12: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Hungarian vs. English - nouns

• Number of word forms: several hundreds (HU) vs. 2-3 (EN)

• Means to express grammatical relations:– Suffixes (HU)– Preposition, fixed position (word order), suffix,

determiner (EN)

• Methods for morphological parsing are very different for Hungarian and English

Page 13: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Verbal suffixes

A stem can be extended by:• Derivational suffixes• Mood markers• Tense markers• Person/number suffixes• Objective markers

Vág-at-ná-kCut-CAUS-COND-3PlObj‘they would have it cut’

Page 14: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Mood and tense in Hungarian

• Mood:– Indicative: default (not marked)– Conditional: suffixes (present) – analytic form

(past)– Imperative: suffixes

• Tense:– Present: default (not marked)– Past: suffixes– Future: analytic (auxiliary fog)

Page 15: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

…and in English

• Mood:– Indicative: default (not marked)– Conditional: past tense forms + analytic forms

(auxiliary would)– Imperative: auxiliaries + grammatical structure

• Tense:– Present: default (not marked)– Past: suffix / irregular forms (suppletives or ablaut

(vowel change))– Future: analytic (auxiliary will)

Page 16: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Person & Number

• Hungarian: suffixes• Fut-ok• Fut-sz• Fut• Fut-unk• Fut-tok• Fut-nak

• 3Sg is the default (not marked!)

• English: 3Sg + pronouns / obligatory subject

• I run• You run• He runs• We run• You run• They run

• 3Sg marked!

Page 17: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Derivational suffixes in Hungarian

• Possibility/permission:

fut-hat-ok

run-MOD-1Sg

‘I may run’• Reflexive:

mos-akod-unk

wash-REFL-1Pl

‘we wash ourselves’

• Frequentative:

üt-öget-sz

hit-FREQ-2Sg

‘you hit sg repeatedly’• Causative:

csinál-tat-nak

do-CAUS-3Pl

‘they have sg done’

Page 18: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

… and in English

• Possibility/permission: auxiliaries

• Reflexive: pronominal objects

• Frequentative: adverb

• Causative: construction

Page 19: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Hungarian vs. English - verbs

• Number of word forms: several hundreds (HU) vs. 4-5 (EN)

• Means to express grammatical relations:– Suffixes + auxiliaries (HU)– Auxiliaries + reflexive pronouns +

constructions (EN)

• A lot of syntactic information is encoded in Hungarian morphemes

Page 20: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Morphology Syntax English

Nominal suffix verb-argument relation

word order, preposition

possessive suffix, preposition

Verbal suffix tense suffix

agreement pronoun, suffix

modality auxiliary

causation construction

aspect construction

reflexivity pronoun

Page 21: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Morphosyntactic coding systems

• Language independent (?)

• Language dependent

• (dis)advantages:– comparability– considering language-specific features– complexity

• Different information is necessary for each language

Page 22: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Hungarian coding systems• HUMOR

– recall Thursday Session 1 – in the Hungarian National Corpus

• MSD– In Szeged Treebank– Parser and POS-tagger available at: http://

www.inf.u-szeged.hu/rgai/magyarlanc• KR

– No database– Parser and POS-tagger available at:

http://mokk.bme.hu/resources/hunmorph/index_htmlhttp://code.google.com/p/hunpos/

Page 23: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

MSD

• Morphosyntactic Description• International coding system:

– English– Romanian– Slovenian– Czech– Bulgarian– Estonian– Hungarian

Page 24: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

MSD - 2

• Positional codes• A given position encodes a given type of

information• Position 0: part-of-speech• Position 1: (sub)type within POS• Further positions: other grammatical information

(person, number, case, etc.)• Irrelevant positions are marked with a hyphen (-)

Page 25: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

KR

• Created for Hungarian

• Hierarchical attribute-value matrices

• Default values (3Sg, singular…)

• Derivational information is encoded

• Compounds are also segmented

Page 26: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

MSD vs. KR

• Differences between the two systems:– derivation– compounds

• Harmonization efforts in order to build a morphological parser the output of which is in total harmony with the Szeged Treebank (magyarlanc) (Farkas et al. 2010)

Page 27: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Nouns in MSDkutya kutya

Nc-sn

‘dog’

kutyámat kutya

Nc-sa---s1

‘my dog-ACC’

kutyaházaikról kutyaház

Nc-ph---p3

‘about their doghouse’

Obamához Obama

Np-st

‘to Obama’

Page 28: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Verbs in MSDfutok fut

Vmip1s---n

‘I run’

futhatsz fut

Voip2s---n

‘you can run’

ütögették üt

Vfis3p---y

‘they were hitting it’

csináltattunk csinál

Vsis1p---n

‘we had sg made’

Page 29: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Morphosyntactically annotated Hungarian corpora

• Hungarian National Corpus– 100-million-word balanced reference corpus of

present-day Hungarian– Word forms automatically annotated for stem, part of

speech and inflectional information– http://corpus.nytud.hu/mnsz/index_eng.html

• Szeged Treebank– 1-million words, 82K sentences– Manually annotated for lemma, POS-tags– Constituency and dependency trees– http://www.inf.u-szeged.hu/rgai/nlp

Page 30: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank

• Manually annotated treebank for Hungarian– Covers various linguistics styles

• literature, newspapers, laws, student essays, computer books, etc.

• multilingual connection: Orwell’s 1984; Win2000 manual in Hungarian

– Available free of charge for research• Developed by

– University of Szeged, HLT group– MorphoLogic Ltd.– Academy of Sciences, Research Institute for

Linguistics

Page 31: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank 2.

• TEI XML format

• Manually annotated– sentence split & word segmentation– morphological analysis– PTB-style syntactic structure– Verb argument structure– converted / extended to Dependency

Grammar format manually

Page 32: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank 3.

• Several versions

• Constituency and dependency versions

• Old MSD codes

• New (harmonized) MSD codes

• (dependency) parser under development

• Being extended with folklore texts

Page 33: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Dependency vs. constituency

• Each node corresponds to a word -> no virtual nodes (CP, I’…) in dependency trees

• Constituency grammars said to be good for languages with fixed word order

• Syntactic relations are determined– by the position in the tree (constituency grammar)– by dependency relations (labeled edges)

(dependency)

Page 34: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Constituency trees in SzT2.0

• Based on generative syntax (É. Kiss et al. 1999)• Syntactic features of Hungarian also considered

(i.e. not hardcore Chomskyan trees)• Verb-argument relations are encoded by labels• Very detailed information: different grammatical

role for each case suffix• Semantic information also can be found

(temporal and locative adverbials)

Page 35: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Aggie all relative-POSS-ACC the day before yesterday see-PAST-3Sg-Obj guest-ESS

‘Aggie received all of her relatives the day before yesterday.’

Page 36: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Page 37: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Dependency trees in Szeged Dependency Treebank

• Based on SzT2.0• Automatic conversion and manual

correction• Word forms are the nodes of the tree• Simplified relations for nominal arguments:

SUBJ, OBJ, DAT,OBL, ATT• Semantic information kept• Sentences without 3Sg copula are

distinctively marked

Page 38: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions.

Page 39: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Virtual nodes

• No overt copula in present tense 3Sg

• Only subject and predicative noun/adjective manifest

• No syntactic structure in SzT (grammatical roles are not marked)

• Virtual nodes in SzDT

Page 40: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

I like to go to school because it is good to be at school though not always.

Page 41: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank vs. Szeged Dependency Treebank

• Labeled relations in both cases -> not so sharp contrast

• Virtual nodes in SzDT -> grammatical structure marked for every sentence (IE, MT)

• No word order constraints in SzDT• Word forms are marked• Other possibilities: morpheme-based syntax

(Prószéky et al. (1989), Koutny, Wacha (1991))

Page 42: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Language-specific morphosyntactic problems

• Morphology vs. syntax:– Pseudo-subjects– Pseudo-objects– Pseudo-datives

• Morphological analysis of unknown words

• Lemmatization of named entities

Page 43: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Pseudo-subjects

• a noun in nominative is not the subject of the sentence -> special attention required when parsing

• Possessor: a kisfiú labdájathe boy ball-3SgPOSSthe boy’s ball

• Predicative noun: István juhász maradt.Stephen shepherd remain-PASTStephen remained a shepherd.

• Object: A kutyám kergeti a macska.The dog-POSS chase-3SgObj the cat‘The cat is chasing my dog.’ (garden path sentence)

A fiam szereti a lányod.The son-1SgPOSS love-3SgObj the daughter-2SgPOSS‘My son loves your daughter’ or ‘Your daughter loves my son’

Page 44: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Solutions

• Possessor:– SzT: one NP includes the possessor and the

possessed ((a kisfiú) labdája)– SzDT: ATT relation

• Predicative noun: PRED relation– Virtual node in SzDT

• Object: OBJ relation– Sometimes contextual information is needed

even for humans…

Page 45: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Pseudo-objectsAdverbials with an apparently accusative ending:

Futottam egy jót.

Run-PAST-1Sg a good-ACC

I have had a good run.

Nagyot aludtam.

Big-ACC sleep-PAST-1Sg

I have slept a lot.

Intransitive verbs -> cannot be an object -> MODE relation

Page 46: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Pseudo-datives

Not all (semantic) subjects are in nominative:• Dative subject:

Sándornak kell elrendeznie az ügyeket.Alexander-DAT must arrange-INF-3Sg the issue-PL

Alexander has to arrange the issues.• DAT in both corpora• Certain auxiliaries with dative subjects

(exceptions)• Dative-nominative parallelism in possessive as

well

Page 47: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Unknown words

• Unknown words can be:– Compounds– Named entities– Derivations

• fémkapunk• félmillió• csokinyúl• NATO-hoz

• Methods for analysis (Zsibrita et al. 2010):– Segmentation into two or

more analyzable parts– Expert rules to filter

impossible combinations (*V+N)

– Analysis of the last part goes to the whole word

– Substitution for hyphenated words (pre-defined patterns for each morphological class)

Page 48: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

félmillió

fél+millió

Mc-snl

fél N half

ADJ half

NUM half

V be afraid

millió NUM million

Expert rules:

NUM + NUM

* non-NUM + NUM

Page 49: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

fémkapunk

fém+kap+unk

Vmip1p---n

fém+kapu+nk

Nc-sn---p1

fém N metal

kap V get

kapu N gate

unk S 1Pl (verb)

nk S 1PlPoss (noun)

Expert rules:

N + N

N-nonNOM + V

* N-NOM + V

Page 50: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

csokinyúl

csoki+nyúl

Vmip3s---n

Nc-sn

cso+kinyúl (?)

Vmip3s---n

csoki N chocolate

nyúl N rabbit

V stretch

kinyúl V stretch out

Expert rules:

N + N

N-nonNOM + V

* N-NOM + V

Page 51: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

NATO-hoz

NATO-hozNATO: VVmip3s---n

NATO-hoz (kalaphoz)NATO: NNp-st

Ordering of rules: 1. substitution2. segmentation

NATO ? NATO

hoz V bring

S to

Expert rules:

N + - + S

N-nonNOM + - + V

* N-NOM + - + V

V + - + V

Substitution:

NATO- -> kalap ‘hat’

Page 52: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Lemmatization

• Lemmatization (i.e. dividing the word form into its root and affixes) is not a trivial task in morphologically rich languages such as Hungarian

• common nouns: relying on a good dictionary

• NEs: cannot be listed• Problem: the NE ends in an apparent

suffix

Page 53: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Lemmatization of NEs

each ending that seems to be a possible suffix is cut off the NE in step-by-step fashion

CitroenbenCitroenben (lemma)

Citroen + ben ‘in (a) Citroen’Citroenb + en ‘on (a) Citroenb’

Citroenbe + n ‘on (a) Citroenbe’• Each possible lemma undergoes a Google and a

Yahoo search – the most frequent one is chosen (Farkas et al. 2008)

Page 54: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

NLP applications

• NER– NEs with suffixes

• Information extraction– Modality, uncertainty– Causation

• Machine translation– Morphemes vs. structures

Page 55: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Named Entities

• NEs should be recognized

• They should be morphosyntactically tagged -> proper syntactic/semantic analysis

A Citroenben a Peugeot meghatározó tulajdonhányadot szerez.

• Mini dictionary + suffix list + semantic frame

Page 56: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

a DET the

ben S in

Citroenben ?

en S on

meghatározó ADJ dominant

n S on

ot S ACC

Peugeot ?

szerez V acquire

t S ACC

tulajdonrész N interest

Page 57: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Possible analyses

• Citroenben

Citroenben

Citroen + ben ‘Citroen-INE’

Citroenb + en ‘Citroenb-SUP’

Citroenbe + n ‘Citroenbe-SUP’

• Peugeot

Peugeot

Peugeo + t ‘Peugeo-ACC’

Peuge + ot ‘Peuge-ACC’

Page 58: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

A semantic frame

<event frame=transaction.ownerchange>[1=V("szerez"|"vásárol"|"vesz"|"megvesz"|"megvásárol"|"felvásárol")+subject=2+direct_object=3]

<rv role=buyer>[2=N]</rv> [3=N("részesedés"|"tulajdon"|"tulajdonrész"|"rész„|

”tulajdonhányad”)+compl1=4+modified_by_adj=5] <rv role=product>[4=N+case=ine+ceg]</rv> <rv

role=newshare>[5=A+measure+modified_by_number=6] [6=NB]</rv>

</event>

Page 59: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Analysis

A Citroenben a Peugeot meghatározó tulajdonhányadot szerez.

Tulajdonhányadot -> ACC/OBJ (3)

Citroenben -> INE (4)

Peugeot -> NOM/SUBJ (2)

‘Peugeot acquires a dominant interest in Citroen.’

Page 60: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Uncertainty• Text Mining:

– derive facts from free text– uncertainty and negation have an impact on the

quality/nature of the information extracted

• applications have to treat sentences / clauses containing uncertain or negated information differently from factual information

• Uncertainty: possible existence of a thing (neither its existence nor its non-existence is claimed)

Page 61: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Uncertainty detection• Uncertainty detection in English: cues

(words with uncertain content)• One typical means to express uncertainty

in Hungarian: -hat/hetHigh school grades may influence health.

A középiskolai jegyek kihathatnak az egészségre.

• Morphological analysis should reflect modality (Voip3s---n)

Page 62: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Causation

• Semantic/thematic relations to be determined properly• AGENT != SUBJECT

Varrattam egy ruhát.sew-CAUS-PAST-1Sg a dress-ACC

‘I had a dress sewn.’Varrattam Marival egy ruhát.

sew-CAUS-PAST-1Sg Mari-INS a dress-ACC‘I had Mary sew a dress.’

Varrtam Marival egy ruhát.sew-PAST-1Sg Mari-INS a dress-ACC

‘I sewed a dress with Mary.’• Causative information should be encoded (Vsip3s---n)

Page 63: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Argument structure of causative verbs

Agent Beneficiary Patient

Varrattam egy ruhát.

? I (NOM) ruha (ACC)

Varrattam Marival egy ruhát.

Mari (INS) I (NOM) ruha (ACC)

Varrtam Marival egy ruhát.

I (NOM) + Mari (INS)

? ruha (ACC)

Page 64: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Machine translation

• Morpheme-based translation would be ideal

• Easier alignment of translational units

• Good morphological parser needed

• Easier to execute in dependency grammar

• Morpheme-based dependency structures

Page 65: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Alignments

at

|

varr

|

t

|

ruha

have

|

sewn

|

dress

ban

|

ház

|

am

in

|

house

|

my

Page 66: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Problems

• Not practical: no corpus available at the moment• Portmanteau morphs – alignment problems• Zero morphs – how many of them?• 3 zero morphs in Hungarian nouns:

könyv-Ø-Ø-Ø vs. könyveit

book-Ø-Ø-Ø book-POSS-POSS.PL-ACC• (Mel’cuk 2006)

Page 67: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

• Morphosyntactic codes might help

• Csinálhattátok Vois2p---y

• Reordering rules

V csinál do

o hat can

i - -

s t PAST

2p tok you

y á it

csinálhattátok

you could do it

Page 68: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

An example

hat

|

csinál

/ | \

t á tok

can

|

do

/ | \

d Ø you

could

/ \

you do

Page 69: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Syntax vs. case suffix

Pseudo-subject Extra rules; PRED, OBJ difficult for humans

Pseudo-object List of adverbs with accusative ending

Pseudo-dative List of verbs with dative subject

Unknown words (lemmas+suffixes)

Guessing (rules)

Information extraction

Thematic/semantic relations

Proper morphosyntactic codes + rules

Uncertainty detection Proper morphosyntactic codes

Machine translation (morpheme-based)

Proper morphosyntactic codes

Page 70: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

Summary

• Syntax-morphology interface in Hungarian

• Morphological coding systems

• Syntactic annotation in Hungarian corpora

• Morphosyntactic problems:– NER– IE– MT

Page 71: The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training

Thematic Training Course on Processing Morphologically Rich Languages

ReferencesÉ. Kiss K., Kiefer F., Siptár P.: Új magyar nyelvtan, Osiris Kiadó, Bp., 1999. Farkas Richárd, Szeredi Dániel, Varga Dániel, Vincze Veronika 2010:

MSD-KR harmonizáció a Szeged Treebank 2.5-ben. In: Tanács Attila, Vincze Veronika (szerk.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 349-353.

Farkas, Richárd; Vincze, Veronika; Nagy, István; Ormándi, Róbert; Szarvas, György; Almási, Attila 2008: Web-based lemmatisation of Named Entities. In: Horák, Ales; Kopeček, Ivan; Pala, Karel; Sojka, Petr (eds.): Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD2008), Berlin, Heidelberg, Springer Verlag, LNCS 5246, pp. 53-60.

Koutny I., Wacha B.: Magyar nyelvtan függőségi alapon. Magyar Nyelv Vol. 87 No. 4. (1991) 393–404.

Mel’cuk, Igor 2006: Aspects of the Theory of Morphology. Mouton de Gruyter.Prószéky, G., Koutny, I., Wacha, B.: Dependency Syntax of Hungarian. In: Maxwell, Dan;

Klaus Schubert (eds.) Metataxis in Practice (Dependency Syntax for Multilingual Machine Translation), Foris, Dordrecht, The Netherlands (1989) 151–181

Zsibrita János, Vincze Veronika, Farkas Richárd 2010: Ismeretlen kifejezések és a szófaji egyértelműsítés. In: Tanács Attila, Vincze Veronika (szerk.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 275-283.