comparative structures in croatian: mwu approach kristina kocijan, sara librenjak department of...

20
Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb [email protected], [email protected] Europhras 2015 Malaga, Spain 2015-07-01

Upload: harry-richards

Post on 08-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Computional approach to idioms Comparative structures as a subtype of idiomatic structures Two manners of computational language processing o Statistical approach o Rule-based approach Idioms o Higly specific part of language (i.e. replacing one word changes the whole meaning) o Statistical approach would yield unprecise results o Rule-based approach preferential, especially when dealing with flective languages

TRANSCRIPT

Page 1: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Comparative Structures in Croatian: MWU Approach

Kristina Kocijan, Sara LibrenjakDepartment of Information and Communication Sciences

University of [email protected], [email protected]

Europhras 2015Malaga, Spain

2015-07-01

Kristina
Sciences
Page 2: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Language of our work - CroatianLanguage of our work - Croatian• South-Slavic language

• High similarity to Bosnian, Serbian and Montenegrin• Latin alphabet• Properties:

• Highly flective (7 cases)• Syntactically flexible (almost any word order possible)• Pronoun dropping

• A challenge for computational processing

03.05.23 2

Page 3: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Computional approach to Computional approach to idiomsidioms

• Comparative structures as a subtype of idiomatic structures

• Two manners of computational language processingo Statistical approacho Rule-based approach

• Idioms o Higly specific part of language (i.e. replacing one word changes the

whole meaning)o Statistical approach would yield unprecise resultso Rule-based approach preferential, especially when dealing with

flective languages

03.05.23 3

Page 4: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Importance of idioms in computatonal Importance of idioms in computatonal

processing of textsprocessing of texts• Present in language, yet often ignored

• Difficult to proccess – described only linguistically• Causing incomplete computational understanding of the

language and unprecise translation• Lack of real data about their frequency

• Why are they diffucult to process?• Because of their multi-word nature• Because of their elusive semantic properties (meaning

is not the sum of the words) • Because of their cultural and historical nuances which

render them very difficult to translate without special preparation

03.05.23 4

Page 5: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Croatian phraseology and Croatian phraseology and comparisonscomparisons

• Well described linguistically (Croatian Dictionary of Idioms with ~2500 entries)o Lack of systematic approach essential for text processingo Sorted into categories for the purpores of this work

• Comparative structures as one of the main categories of idiomso Radi kao pčela (Working hard as a bee)o Puši kao Turčin (Smokes like a pipe, lit. Like a Turk)o Brz poput strijele (Fast as an arrow)

• Approximately 540 set comparative phrases in Croatian (Fink-Arnovski)

03.05.23 5

Kristina
purposes
Page 6: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Comparisons in literature and Comparisons in literature and beyondbeyond

• Comparative structures (usporedbe ili poredbe) mainly a feature of literary texts and newspapero Filaković (2008) assumes their presence in the works of fiction by

analyzing the works of Croatian writer I.B.Mažuranićo Kovačević (2012) reports linguistic creativity in use of comparative

structures in newspaper articleso Mance and Trtanj (2010) note the usage of modern slang variants of

the comparisons• No statistical data about their real usage in

various types of text

03.05.23 6

Page 7: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Goals of this workGoals of this work• To build a tool for automated processing of

the comparative idioms in Croatian texts• To be able to recognize them in any type of the

text as the multi word unito Extract, describe and ennumerate the structureso Collect the statistical data about their frequency in different

styles of textso Serve as an example for similar work in other languageso Be used as a tool in automated or semi-automated machine

translation of Croatian to any lanugage (provided the additional work)

03.05.23 7

Page 8: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

NooJ – a tool for rule based NooJ – a tool for rule based automated text processingautomated text processing

• NooJ – free to use linguistic development environment for various kinds of rule-based automated text and corpora processing

• http://nooj4nlp.net/ • Morphological, syntactic and semantic processing

with options for translation and transformation of sentences

• Ready made resources for dozen languages:o Acadian, Arabic, Armenian, Belarusian, Bulgarian, Catalan, Croatian,

English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Polish, Portuguese, Russian, Serbian, Slovene, Spanish, Turkish, Vietnamese

• Great tool for highly flective languages

03.05.23 8

Page 9: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

MethodologyMethodology1. Listing and categorizing the idioms2. Definition and recognition of rules3. Construction of training and testing corpora4. Construction of grammars for processing texts

o Using NooJ as a platform5. Testing phase6. Calculation of results

03.05.23 9

Page 10: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Listing and categorizing the Listing and categorizing the idiomsidioms

• Based on Croatian Dictionary of Idioms and idioms manually found in Croatian corpus

• For the purposes of computational approach, we defined five major categories• a) Noun phrase with an attribute or apposition• b) Verbal phrase with a direct object• c) Verbal phrase with the optional direct object

which can disrupt the syntactic structure• d) Comparative structure (A/V as N)• e) Fixed phrase which doesn't change in any

syntactic environment

03.05.23 10

Page 11: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Definition and Definition and recognition of rulesrecognition of rules

• 312 different comparative construcion in our dictionaryo Recognized in any form, tense, case and word order

• Divided into 5 subcategories due to sytactic properties1. Adjective AS Noun = 892. Noun AS Preposition = 93. AS a Noun/Adjective =49

1. AS a Noun (7)2. AS a PP fixed phrase (37)3. AS a N + PP (5)

4. Verb AS Noun = 1575. AS IF Verb = 8

03.05.23 11

Page 12: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Construction of training Construction of training and testing corporaand testing corpora

• First phase: trainingo A smaller corpus of sentences exclusively containing the structures in

question (comparative structures with phrases „kao” or „poput”)• Second phase: testing

o After the completion of the grammars (NooJ files for processing texts), results are tested on the bigger corpus

o Corpus 1: random texts from the Web corpus of differents styles of text (2,2 million words corpus)

o Corpus 2: literal text of mostly Croatian authors (658 Kw corpus)

03.05.23 12

Page 13: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Construction of grammars Construction of grammars for processing textsfor processing texts

• Grammar – a file constructed in NooJ environment, made for syntactic processing of the texts

• Input, output, variebles, nested grammars

• Concordance with marked texts as an output

03.05.23 13

Page 14: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Adjective Adjective AS AS NounNoun

Recognizes:Lijep kao slika (pretty as a picture)Pijan kao smuk (drunk as a sponge)Brz kao zec (fast as a bullet)03.05.23 14

Page 15: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Noun Noun ASAS prepositon prepositon

ASAS a Noun a Noun

Recognizes:Kao drvena Marija (being stiff, unrelaxed)Poput guske u magli (without thinking)

Recognizes:Mrak kao u rogu (pitch dark)

03.05.23 15

Page 16: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Verb Verb ASAS Noun Noun

AS IF AS IF VerbVerbRecognizes:Kao da je u zemlju propao (as if the Earth swallowed him)Kao da je pao s Marsa (clueless, as if he came from Mars)

Recognizes:Ići kao po loju (go smoothly, slide like over the fat)Šutjeti kao grob (be silent as a grave)

03.05.23 16

Page 17: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Example of resultsExample of results

Comparativestructure

03.05.23 17

Page 18: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

EvaluationEvaluation

03.05.23 18

Kilo-words (Kw)

Number of structures found

Precision

Recall F-measure

Training corpus

100% 96% 98%

Corpus 1 (web)

2247 Kw 22

Corpus 2 (books)

658 Kw 67

Average

Page 19: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Conclusions about comparison Conclusions about comparison in Croatianin Croatian

• Number of comparative structures in different types of texts varies greatlyo General texts (web corpus) – 1 per every 10000 wordso Literal texts (books from Croatian authors) – 1 per every 1000 words

• Confirmed hypothesis that such structures are pertaining mostly to literal styleo 10 times more frequent in books and works of fictiono Rare in other styles of writing due to the stylistic marking they bring

to the text

03.05.23 19

Page 20: Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb

Thank you for your attention.Thank you for your attention.

Questions?Questions?