application of intex in refinement and validation of serbian wordnet ivan obradović, ranka...

Application of INTEX in refinement and validation of

Serbian WordNet

Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić

University of Belgrade

WordNet (WN)

a semantic network of concepts represented by synsets – sets of synonymous words (nouns, verbs, adjectives & adverbs)contains explicitly coded descriptions of semantic relationsinspired by research in the field of psycholinguistics initially developed at Princeton for the English language

Fellbaum C. (ed.), (1998) WordNet: An Electronic Lexical Database, The MIT Press

Multilingual WordNets

Featuring: the InterLingual Index (ILI)EuroWordNet (EWN): Dutch, Italian, Spanish, German, French, Czech and Estonian BalkaNet (BWN) five Balkan languages: Greek, Turkish, Bulgarian, Romanian and Serbian, as well as Czech

Vossen, P. (ed.) (1998) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht

Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET: A Multilingual Semantic Network for Balkan Languages, 1st International Wordnet Conference, Mysore, India, January 2002 (http://www.ceid.upatras.gr/Balkanet/files/balkanet-elsnet-ko-accept.pdf)

The WN semantic network

based on a grouping of synonyms into synsets - representing network nodesnodes are interconnected by arcs which describe particular semantic relations (hyperonymy, hyponymy, antonymy etc.)in general, every synset is accompanied by a definition (gloss) and examples of usage that specify the meaning of the concept represented by the synset the semantic network itself is an XML-document with a precisely established set of entities

The Serbian version of WN

developed starting from the base concepts of the English WN using existing English/Serbian dictionaries in paper form synset elements represented as the elements in DELAS or DELAC dictionaries without any additional morphosyntactic information lexical meanings in Serbian coded with reference to the dictionary of Matica Srpska

XML representation of a synset in Serbian WN

(demonstrate, establish, prove, show)

<SYNSET> <ID>ENG171-00528591-v</ID> <SYNONYM>

<LITERAL> dokazati <SENSE> 1 </SENSE> </LITERAL><LITERAL> dokazivati <SENSE> 1 </SENSE> </LITERAL><LITERAL> pokazati <SENSE> 3 </SENSE> </LITERAL><LITERAL> pokazivati <SENSE> 3 </SENSE> </LITERAL>

</SYNONYM><DEF> Utvrditi valxanost necyega, primerom, objasxnxenxem ili eksperimentom. (Establish the validity of something by example, explanation or experiment)</DEF> <USAGE> Anketa je pokazala da u tako nesxto veruje mali broj ispitanih. (The poll showed that few people believe in this) </USAGE> <POS>v</POS> <ILR>ENG171-00529622-v <TYPE>hypernym</TYPE></ILR> <BCS>1</BCS> <STAMP>Dusko 2003/04/21</STAMP>

</SYNSET>

Problems in Serbian WN that might be solved using INTEX

lack of morphological and syntactic information related to lexemes

absence of precise criteria for the selection of lexemes for a particular synset

lack of information on relative relevance of each lexeme in a synset in terms of its lexical frequency

Incorporation of morphosyntactic information into synsets using

INTEX

The DictWNSrp program

matches literals in WN with literals in selected Delas dictionaries and extracts morphosyntactic information from dictionariesassigns morphosyntactic information to WN literals in cases of a 1-1 matchoffers the user the option to confirm or alter the assigned information and resolve cases of homography (e.g. multiple matches)transfers confirmed morphosyntactic information into the WN using the LNOTE element

Resolving homography with the DictWNSrp program

XML representation of a synset with assigned morphosyntactic

information

<SYNONYM> <LITERAL>dokazati <SENSE>1</SENSE>

<LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE></LITERAL> <LITERAL>dokazivati <SENSE>1</SENSE>

<LNOTE>V18+Imperf+Tr+Iref</LNOTE> </LITERAL>

<LITERAL>pokazati <SENSE>3</SENSE><LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE>

</LITERAL> <LITERAL>pokazivati <SENSE>3</SENSE>

<LNOTE>V18+Imperf+Tr+Iref</LNOTE> </LITERAL> </SYNONYM>

Validation of lexemes from a synset on a corpus

Phase One: The IntexWN programselects and displays all synsets from WN for a given lexemeconstructs Intex graphs for all lexemes from selected synsets

Phase Two: INTEXproduces concordances from a chosen corpus for graphs constructed by IntexWN

Phase Three: Userchecks the validity of synonymous relations of lexemes on concordancesdecides on removing or adding new lexemes to the synset

Constructing a graph for all lexemes from a synset with the IntexWN program

Validation results for synset ENG171-11771798(being, beingness, existence)

LexemeOccurences

(1)

Frequency

(2)

Validated

occurences

(3)

Frequency

(4)

Ratio

(3/ 1)

bivstvo 0 0,00 0 0,00 n/ a

egzistencija 7 0,07 1 0,04 0,14

zxivot 72 0,77 7 0,30 0,10

postojanxe 15 0,16 15 0,65 1,00

Total 94 23 0,24

Comments:

the lexemes used in the synset have been used to denote the given concept in 24% of concordances

the lexeme most frequently used to denote the given concept is postojanxe

although zxivot is the most frequent lexeme in the synset, it has been used to denote the given concept only in 10% of cases

bivstvo does not occur in the corpus and its exclusion from the synset could be considered if a similar result is obtained on a wider corpus

Further developments

definition of more precise criteria for validation of lexemes in a synset based on their occurrence in corporainvestigation of possibilities for introducing relevance information in synsetsfurther development of the IntexWN program to include semantic relations, such as hyponymy/ hyperonymy etc. introduction of near-synonym information into the Serbian WN using INTEX dictionaries (e.g. augmentatives/diminutives) investigation of possibilities for introducing multi-lingual features into INTEX using the WN (to be used for parallel corpora)

application of intex in refinement and validation of serbian wordnet ivan obradović, ranka...

Documents