svetla koeva, max silbetztein 8th intex / nooj workshop, 30 may, 2005

28
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005

Upload: akando

Post on 11-Jan-2016

50 views

Category:

Documents


1 download

DESCRIPTION

Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval. Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005. Main research goals. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Integrating Semantic Dictionaries for English, French and Bulgarian

into the NooJ System for the Purposes of Information Retrieval

Svetla Koeva, Max Silbetztein

8th INTEX / NooJ Workshop,

30 May, 2005

Page 2: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Main research goals

• To provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system:– to create specialized Semantic Dictionaries for

English, French and Bulgarian based on WordNet semantic relations;

– to provide compete formalization of the inflection for simple and compound words included in the Wn structure.

Page 3: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

History

• The integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop.

• Later on the idea was advanced into the Joint research RILA project

Information retrieval based on semantic relations

– LASELDI, Université de Franche-Comté – Department of Computational Linguistics, IBL,

Bulgarian Academy of Sciences.

Page 4: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Language resources

• Bulgarian grammatical dictionary (BGD) – over 83 000 lemmas and 1 100 000 word forms;

• English WordNet 2.0 – 115 424 synonymous sets;• Bulgarian WordNet (BalkaNet project) – 22 867

synonymous sets;• French WordNet (EuroWordNet project) – 33 512

synonymous sets;• English dictionary – over 30 000 lemmas (not

inflected);• French dictionary – extracted with INTEX.

Page 5: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Implementation tasks

• To transform the format of the BGD into the NooJ standard;

• To create semantic dictionaries for Bulgarian and English;

• To associate lemmas from the Bulgarian semantic dictionaries with the corresponding inflection types;

• To add missing lemmas and inflection types in BGD, if any;

• To create extensive dictionaries and corresponding inflection types for compounds.

Page 6: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Information structure design

• Category information –6 classes: Noun, Verb, Adjective, Pronoun, Numeral, Others (Adverb, Preposition, Conjunction, Particle, Interjection) ;

• Paradigmatic information – Personal, Transitive, Perfective, Common, …;

• Grammatical information – Inflection, Conjugation, Sound alternations, ….

Page 7: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Grammatical subclasses

• Nouns - 22 subclasses with respect of their Type (Common, Proper, Singularia tantum, Pluralia tantum) and Gender;

• Verbs – 32 subclasses with respect of Transitivity, Perfectiveness, and Personality;

• Adjectives – 2 subclasses;• Pronouns – 26 subclasses with respect of their

Type and Possessor;• Numerals – 6 sunclasses.

Page 8: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Grammatical types

• Noun – Number, Definiteness, Counting form, Case, Optional forms – 266 types;

• Verb – Person, Number, Tense, Mood, Voice, Participles, Gender, Definiteness – 257 types;

• Adjective – Gender, Number, Definiteness – 30 types;

• Pronoun – Gender, Person, Number, Definiteness, Case, Clitic, Possessing – 28 types;

• Numeral – Gender, Number, Definiteness, Approximate form, Male form – 20 types.

Page 9: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Dictionary format

а,ЧА,0 ПРИ, 7 sm0, Ok, ‘‘абсол`ютен, ПРИ, 7 smh, Ok, '2RCия‘`август, С+М, 10 sml, Ok, '2RCият‘авиокомп`ания, С+Ж, 1 sf0, Ok, '2RCа‘австр`ийски, ПРИ, 3 sfd, Ok, '2RCата‘автоб`ус, С+М, 11 sn0, Ok, '2RCо‘автомат`ичен, ПРИ, 7 snd, Ok, '2RCото‘адрес`ирам, Г+Н+Т, 4 p0, Ok, '2RCи‘агит`ирам, Г+Н+Т, 4 pd, Ok, '2RCите'

Page 10: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Transforming BGD

Perl Script

DictionaryGrammatical

types Transliteration

of labels

Page 11: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ dictionary

→aбсол`ютен, ПРИ, 7 aбсолютен,A+FLX=A-7

`август, С+М, 10 август,N+M+FLX=N_M-10

авиокомп`ания, С+Ж,1 авиокомпания,N+F+FLX=N_F-1

aвстр`ийски, ПРИ, 3 aвстрийски,A+FLX=A-3

автоб`ус, С+М, 11 автобус,N+M+FLX=N_M-11

автомат`ичен, ПРИ, 7 автоматичен,A+FLX=A-7

адрес`ирам,Г+Н+Т,4 адресирам,V+IT+FLX=V_IT-4

Page 12: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ formal descriptions

→sm0, Ok, ‘‘ A-7 = <E>/sm0 +smh, Ok, '2RCия‘ <L2><S><R>ия<S1>/smh + sml, Ok, '2RCият‘ <L2><S><R>ият<S1>/sml +sf0, Ok, '2RCа‘ <L2><S><R>а<S1>/sf0 +sfd, Ok, '2RCата‘ <L2><S><R>ата<S1>/sfd +sn0, Ok, '2RCо‘ <L2><S><R>о<S1>/sn0 +snd, Ok, '2RCото‘ <L2><S><R>ото<S1>/snd + p0, Ok, '2RCи‘ <L2><S><R>и<S1>/p0 + pd, Ok, '2RCите‘ <L2><S><R>ите<S1>/pd;

Page 13: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

WordNet semantic relations

ILR POS/POS EW2.0 BulNet

HYPERONYMY N/N V/V 94 844 15 838

NEAR ANTONYMY N/N A/A V/V 7 642 1 847

PART MERONYMY N/N 8 636 1 241

MEMBER MERONYMY N/N 12 205 841

PORTION MERONYMY N/N 787 107

SUBEVENT V/V 409 162

CAUSES V/V 439 104

SIMILAR TO A/A V/V 22 196 1 479

VERB GROUP V/V 1 748 848

ALSO SEE A/A V/V 3 240 895

Page 14: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Other relations

ILR POS/POS EW2.0 BulNet

BE IN STATE A/N 1 296 591

BG DERIVATIVE N/V 36 630 6 469

DERIVED A/N 6 809 1 071

PARTICIPLE A/V 401 56

REGION DOMAIN N/N V/N A/N B/N 1 280 4

USAGE DOMAIN N/N V/N A/N B/N 983 22

CATEGORY DOMAIN N/N V/N A/N B/N 6 166 638

Page 15: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Selected relations

• Synonymy (reflexive, symmetric, and transitive relation of equivalence);

• Hypernymy (inverse, asymmetric, and transitive relation between synonym sets),

• Meronymy (inverse, asymmetric, and transitive relation between synonym sets):

Part meronymy;

Member meronymy;

Portion meronymy.

Page 16: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Selected relations

• Similar to (symmetric relation between similar adjectival synsets);

• Verb group (symmetric relation between semantically related verb synsets);

• Also see (symmetric relation between synsets - verbs or adjectives, that are close in meaning);

• Category domain (asymmetric extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to).

Page 17: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

DELAF semantic dictionaries

• These dictionaries consist of pairs of literals defined for the corresponding semantic relation:– car,automobile.N

– auto,automibile.N

• All possible combinations between literals in the given synsets are listed: – car,automobile.N

– cars,automobile.N

– auto,automibile.N

– autos,automibile.N

Page 18: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Semantic dictionaries

Synonymy relation‘a plant consisting of buildings with facilities for

manufacturing’

фабрика,N+FLX=ENG20-03196165-nпредпрятие,N+FLX=ENG20-03196165-n

factory,N+FLX=ENG20-03196165-nmill,N+FLX=ENG20-03196165-nmanufacturing plant,N+FLX=ENG20-03196165-nmanufactory,N+FLX=ENG20-03196165-n

Page 19: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Semantic dictionaries

Hypernymy relation‘the organized action of making of goods and services

for sale’

производство,N+FLX=ENG20-00859333-nпромишленост,N+FLX=ENG20-00859333-nиндустрия,N+FLX=ENG20-00859333-n

production,N+FLX=ENG20-00859333-nindustry,N+FLX=ENG20-00859333-nmanufacture,N+FLX=ENG20-00859333-n

Page 20: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Inflecting wordnet<SYNSET>

<ID>...</ID><POS>...</POS><SYNONYM>

<LITERAL>otstranqwam (to remove)<SENSE>…</SENSE><LNOTEGR>ГНТ12</LNOTEGR>

</LITERAL></SYNONYM><ILR>...<TIPE>...</TYPE></ILR><DEF>

remove something concrete, as by lifting, pushing, taking off, etc. or remove something abstract </DEF><BCS>...</BCS>

</SYNSET>

Page 21: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Semantic descriptions

‘the organized action of making of goods and services for sale’

ENG20-00859333-n = <E>/Hs0 + то/Hsd + <L1>а<S1>/Hp0 + <L1>ата<S1>/Hpd + <L9>мишленост<S9>/Ss0 + <L9>мишлеността<S9>/Ssd + <L9>мишлености<S9>/Sp0 + <L9>мишленостите<S9>/Spd + <B12>индустрия/Ss0 + <B12>индустрията/Ssd + <B12>индустрии/Sp0 + <B12>индустриите/Spd;ENG20-00859333-n = <E>/Hs + <B10>industry/Ss + <B10>industries/Sp0+ <B10>manifactures/Ss + <B10>manifactures/Sp;

Page 22: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

After the nice solutions

• Lemmas which are not included in the BGD:– Lemmas classification to existing inflection types;– Formal description of new inflection types– Literals in Latin;– Validating WordNet.

• Semantic ambiguity - literals with two inflectional descriptions in BGD;

• Compound words– Formal description of inflection types;– Compounds classification.

Page 23: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Compound semantic descriptions

ENG20-04182583-n = <E>/Ss0 + <P>та/Ssd + <B>и<P><B>(и/p0 +ите/pd) + <B7>завод<P><B2>ен/Ss0 + <B7>завод<P><B2>ния/Ssh + <B7>завод<P><B2>ният/Ssl + <B7>заводи<P><B2>ни/Sа0 + <B7>заводи<P><B2>ните/Sа0 + <B7>рафинерия/Ss0 + <B7>рафинерия<P>та/Ssd + <B7>рафинерии<P><B>и/Sp0 + <B7>рафинерии<P><B>ите/Spd;

Page 24: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005
Page 25: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005
Page 26: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005
Page 27: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Applications of the Semantic Dictionaries

• Information retrieval by means of semantic equivalence with synonymy dictionaries;

• Information retrieval by means of semantic specification with hyperonymy and meronymy dictionaries;

• Information retrieval by means of similarity;• Information retrieval by means thematic domains

affiliations;• Validation WordNet structure against its

completeness and consistency.

Page 28: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Future directions

• Extensions and enhancements of the semantic dictionaries by means of:– Extension of the dictionaries coverage;– Addition of other semantic relations;– Inclusion of additional information to the entries.

• Integration of multilingual semantic extraction with NooJ using the Inter-Lingual-Index relation.