application of intex in refinement and validation of serbian wordnet ivan obradović, ranka...
TRANSCRIPT
Application of INTEX in refinement and validation of
Serbian WordNet
Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić
University of Belgrade
WordNet (WN)
a semantic network of concepts represented by synsets – sets of synonymous words (nouns, verbs, adjectives & adverbs)contains explicitly coded descriptions of semantic relationsinspired by research in the field of psycholinguistics initially developed at Princeton for the English language
Fellbaum C. (ed.), (1998) WordNet: An Electronic Lexical Database, The MIT Press
Multilingual WordNets
Featuring: the InterLingual Index (ILI)EuroWordNet (EWN): Dutch, Italian, Spanish, German, French, Czech and Estonian BalkaNet (BWN) five Balkan languages: Greek, Turkish, Bulgarian, Romanian and Serbian, as well as Czech
Vossen, P. (ed.) (1998) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht
Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET: A Multilingual Semantic Network for Balkan Languages, 1st International Wordnet Conference, Mysore, India, January 2002 (http://www.ceid.upatras.gr/Balkanet/files/balkanet-elsnet-ko-accept.pdf)
The WN semantic network
based on a grouping of synonyms into synsets - representing network nodesnodes are interconnected by arcs which describe particular semantic relations (hyperonymy, hyponymy, antonymy etc.)in general, every synset is accompanied by a definition (gloss) and examples of usage that specify the meaning of the concept represented by the synset the semantic network itself is an XML-document with a precisely established set of entities
The Serbian version of WN
developed starting from the base concepts of the English WN using existing English/Serbian dictionaries in paper form synset elements represented as the elements in DELAS or DELAC dictionaries without any additional morphosyntactic information lexical meanings in Serbian coded with reference to the dictionary of Matica Srpska
XML representation of a synset in Serbian WN
(demonstrate, establish, prove, show)
<SYNSET> <ID>ENG171-00528591-v</ID> <SYNONYM>
<LITERAL> dokazati <SENSE> 1 </SENSE> </LITERAL><LITERAL> dokazivati <SENSE> 1 </SENSE> </LITERAL><LITERAL> pokazati <SENSE> 3 </SENSE> </LITERAL><LITERAL> pokazivati <SENSE> 3 </SENSE> </LITERAL>
</SYNONYM><DEF> Utvrditi valxanost necyega, primerom, objasxnxenxem ili eksperimentom. (Establish the validity of something by example, explanation or experiment)</DEF> <USAGE> Anketa je pokazala da u tako nesxto veruje mali broj ispitanih. (The poll showed that few people believe in this) </USAGE> <POS>v</POS> <ILR>ENG171-00529622-v <TYPE>hypernym</TYPE></ILR> <BCS>1</BCS> <STAMP>Dusko 2003/04/21</STAMP>
</SYNSET>
Problems in Serbian WN that might be solved using INTEX
lack of morphological and syntactic information related to lexemes
absence of precise criteria for the selection of lexemes for a particular synset
lack of information on relative relevance of each lexeme in a synset in terms of its lexical frequency
Incorporation of morphosyntactic information into synsets using
INTEX
The DictWNSrp program
matches literals in WN with literals in selected Delas dictionaries and extracts morphosyntactic information from dictionariesassigns morphosyntactic information to WN literals in cases of a 1-1 matchoffers the user the option to confirm or alter the assigned information and resolve cases of homography (e.g. multiple matches)transfers confirmed morphosyntactic information into the WN using the LNOTE element
Resolving homography with the DictWNSrp program
XML representation of a synset with assigned morphosyntactic
information
<SYNONYM> <LITERAL>dokazati <SENSE>1</SENSE>
<LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE></LITERAL> <LITERAL>dokazivati <SENSE>1</SENSE>
<LNOTE>V18+Imperf+Tr+Iref</LNOTE> </LITERAL>
<LITERAL>pokazati <SENSE>3</SENSE><LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE>
</LITERAL> <LITERAL>pokazivati <SENSE>3</SENSE>
<LNOTE>V18+Imperf+Tr+Iref</LNOTE> </LITERAL> </SYNONYM>
Validation of lexemes from a synset on a corpus
Phase One: The IntexWN programselects and displays all synsets from WN for a given lexemeconstructs Intex graphs for all lexemes from selected synsets
Phase Two: INTEXproduces concordances from a chosen corpus for graphs constructed by IntexWN
Phase Three: Userchecks the validity of synonymous relations of lexemes on concordancesdecides on removing or adding new lexemes to the synset
Constructing a graph for all lexemes from a synset with the IntexWN program
Validation results for synset ENG171-11771798(being, beingness, existence)
LexemeOccurences
(1)
Frequency
(2)
Validated
occurences
(3)
Frequency
(4)
Ratio
(3/ 1)
bivstvo 0 0,00 0 0,00 n/ a
egzistencija 7 0,07 1 0,04 0,14
zxivot 72 0,77 7 0,30 0,10
postojanxe 15 0,16 15 0,65 1,00
Total 94 23 0,24
Comments:
the lexemes used in the synset have been used to denote the given concept in 24% of concordances
the lexeme most frequently used to denote the given concept is postojanxe
although zxivot is the most frequent lexeme in the synset, it has been used to denote the given concept only in 10% of cases
bivstvo does not occur in the corpus and its exclusion from the synset could be considered if a similar result is obtained on a wider corpus
Further developments
definition of more precise criteria for validation of lexemes in a synset based on their occurrence in corporainvestigation of possibilities for introducing relevance information in synsetsfurther development of the IntexWN program to include semantic relations, such as hyponymy/ hyperonymy etc. introduction of near-synonym information into the Serbian WN using INTEX dictionaries (e.g. augmentatives/diminutives) investigation of possibilities for introducing multi-lingual features into INTEX using the WN (to be used for parallel corpora)