bilingual dictionary drafting: bootstrapping wordnet and babelnet · 2017-09-25 ·...

32
19 Sept. 2017 eLex 2017 | BDD: WordNet and BabelNet Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet David Lindemann UPV/EHU University of the Basque Country [email protected] Fritz Kliche University of Hildesheim [email protected]

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017

eLex 2017 | BDD: WordNet and BabelNet

Bilingual Dictionary Drafting:Bootstrapping WordNet and BabelNet

David LindemannUPV/EHU University of the Basque [email protected]

Fritz KlicheUniversity of [email protected]

Page 2: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 2/32

[email protected] [email protected]

Overview

IntroductionMotivation

Overview of Bilingual Dictionary Drafting (=BDD) methods

Some previous research

BDD using concept-oriented lexical resources: The example of Basque-EnglishConcept-oriented vs. headword-oriented resources

Data extraction from WordNet / BabelNet: Workflow

Basque-English dictionary draft: EvaluationStandard Basque dictionary headwords

Quantitative Evaluation: BabelNet English-Basque intersection

Qualitative Evaluation: Assessment of translation equivalents

Post-processing / editing issues

Conclusions and Further Work

Page 3: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 3/32

[email protected] [email protected]

Motivation

400+ languages with 1 million L1-speakers or more

Availability of bilingual dictionaries: Many scarcely resourced language pairsEven where one of the top ten languages is involvedExample Basque: Only ES, FR, EN, RU, (DE) are covered

Possible ad-hoc-workarounds for scarcely resourced language pairs:(1) To use two bilingual dictionaries(2) To use an automatically built dictionary or MT (more and more of them available)Disadvantages

Time consumingMislead lookups (main problem: Polysemy / asymmetric lexicalization)

Lexicography for uncovered language pairs (=from scratch)Automated drafting of translation equivalent pairs

Saves human resourcesReciprocal bootstrapping: Upgrading of the resources employed for BDD

Page 4: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 4/32

[email protected] [email protected]

BDD Methods: A brief overview

Corpus-basedWord alignment in parallel corpora

Bilingual parallel corpus: Bilingual word lists (without Word Sense Disambiguation)

Gale & Church 1991, Heja 2010, among others

Multilingual parallel corpus: Information for WSD using asymmetries in lexicalization across languages

cf. Lefever 2012, 2014 among others; see Kazakov & Shahid 2013 for a survey

Page 5: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 5/32

[email protected] [email protected]

BDD Methods: A brief overview

Corpus-basedWord alignment in parallel corpora

Bilingual parallel corpus: Bilingual word lists (without Word Sense Disambiguation)

Gale & Church 1991, Heja 2010, among others

Multilingual parallel corpus: Information for WSD using asymmetries in lexicalization across languages

cf. Lefever 2012, 2014 among others; see Kazakov & Shahid 2013 for a survey

Dictionary Pivoting Connecting lemma-based lexical resources to each other

Filtering of polysemy related errors with corpus-based methods

cf. Saralegi et al. 2012, among others

Page 6: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 6/32

[email protected] [email protected]

BDD Methods: A brief overview

Corpus-basedWord alignment in parallel corpora

Bilingual parallel corpus: Bilingual word lists (without Word Sense Disambiguation)

Gale & Church 1991, Heja 2010, among others

Multilingual parallel corpus: Information for WSD using asymmetries in lexicalization across languages

cf. Lefever 2012, 2014 among others; see Kazakov & Shahid 2013 for a survey

Dictionary Pivoting Connecting lemma-based lexical resources to each other

Filtering of polysemy related errors with corpus-based methods

cf. Saralegi et al. 2012, among others

Bootstrapping concept-oriented resourcesWikipedia Interlanguage Links (cf. Navigli & Ponzetto 2010)

Open Multilingual WordNet (Bond & Foster 2013, cf. Varga et al. 2009)

ConceptNet (Speer & Havasi 2012)

BabelNet (Navigli & Ponzetto 2010)

Page 7: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 7/32

[email protected] [email protected]

Own previous research on BDD

Lindemann et al. 2014 (Euralex Bolzano)Set of (semi)-automatic methods for German-Basque bilingual word list building

Without Word Sense Disambiguation

Showcase German-Basque, an scarcely resourced language pairData for 2/3 of German 40,000 frequency lemma list, half of it accurate

Page 8: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 8/32

[email protected] [email protected]

Own previous research on BDD

Lindemann et al. 2014 (Euralex Bolzano)Set of (semi)-automatic methods for German-Basque bilingual word list building

Without Word Sense Disambiguation

Showcase German-Basque, an scarcely resourced language pairData for 2/3 of German 40,000 frequency lemma list, half of it accurate

Lindemann & San Vicente 2016 (Euralex Tbilisi)Proposal of a lexicographic workflow for bilingual dictionaries with Basque

BDD including discrimination of homographous lemmata and word sensesDrafting of lemma list and lemma-POS-entities by bootstrapping Basque NLP resources

Linking to translation equivalents at word sense level via Princeton WordNet

Automatic and manual gap detection

Manually edited lexical data eventually sent back to Basque WordNet and other data providers

Page 9: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 9/32

[email protected] [email protected]

Own previous research on BDD

Lindemann et al. 2014 (Euralex Bolzano)Set of (semi)-automatic methods for German-Basque bilingual word list building

Without Word Sense Disambiguation

Showcase German-Basque, an scarcely resourced language pairData for 2/3 of German 40,000 frequency lemma list, half of it accurate

Lindemann & San Vicente 2016 (Euralex Tbilisi)Proposal of a lexicographic workflow for bilingual dictionaries with Basque

BDD including discrimination of homographous lemmata and word sensesDrafting of lemma list and lemma-POS-entities by bootstrapping Basque NLP resources

Linking to translation equivalents at word sense level via Princeton WordNet

Automatic and manual gap detection

Manually edited lexical data eventually sent back to Basque WordNet and other data providers

Lindemann & Kliche 2017 (eLex Leiden: this paper)Quantitative and qualitative evaluation of Basque-English BDD

Basque WordNet EusWN 3.0, English Princeton WordNet 3.0

BabelNet 3.7

Page 10: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 10/32

[email protected] [email protected]

Lemma-oriented vs. Concept-oriented

Pferd

Polysemy: 3 word senses

Page 11: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 11/32

[email protected] [email protected]

Lemma-oriented vs. Concept-oriented

Pferd

Pferd

Gaul

Ross

Polysemy: 3 word senses Synonymy

Pferd

Gaul

Ross

Synonymy

Caballo

Horse

Zaldi

Translation Equivs.

Page 12: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 12/32

[email protected] [email protected]

Workflow: A quick walkthrough

WordNetDownload WordNets in table (csv) format:

Interlingual Index (Synset IDs)Lexicalisations in the 2 languages

Build single XML document

Page 13: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 13/32

[email protected] [email protected]

Workflow: A quick walkthrough

WordNetDownload WordNets in table (csv) format:

Interlingual Index (Synset IDs)Lexicalisations in the 2 languages

Build single XML document

BabelNetDownload complete dump fileRetrieve using BabelNet Java API:

Synset IDssynset type (concept / NE), English glossesLexicalisations in the 2 languages, sources

Build single XML document

Page 14: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 14/32

[email protected] [email protected]

Workflow: A quick walkthrough

WordNetDownload WordNets in table (csv) format:

Interlingual Index (Synset IDs)Lexicalisations in the 2 languages

Build single XML document

BabelNetDownload complete dump fileRetrieve using BabelNet Java API:

Synset IDssynset type (concept / NE), English glossesLexicalisations in the 2 languages, sources

Build single XML document

Intersection calculations („quantitative evaluation“)Graphical normalization of lemma-strings

Initial case, spaces, hyphens

Assessment of adequacy („qualitative evaluation“)For the evaluators, build a user-friendly view of the XML document

Show glosses and lexicalisations

Show drop-down menu for choosing assessment value

Done using features of TshwaneLex

Page 15: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 15/32

[email protected] [email protected]

BabelNet 3.7 English-Basque intersection

Named Entities: 24,3 %Named Entities: 24,3 %Place names, proper names

May be translated: Den Haag, The Hague, Haga

Untranslated „BabelNet“ Concepts: 71,1%Untranslated „BabelNet“ Concepts: 71,1%Presumed ‘internationalisms‘

pasta, samba, brahman, yoga, ...

Biology, medicine terms (Greek-Latin)

IT terms (English)

Abbreviations: m, cm, kg

Translated Concepts: 114,000 (4,6%)Translated Concepts: 114,000 (4,6%)95%+ of what we are looking for belongs to this group

Sources for Basque Concept translations found in BabelNet:

Open Multilingual WordNet, Wikidata, Wikipedia Page Titles, Wikipedia Redirections, OmegaWiki, Wiktionary, Microsoft Terminology,GeoNames, WikiQuotes, WikiQuotes Redirections

2.4 Million Synsets

Page 16: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 16/32

[email protected] [email protected]

Basque lemmata we want to find equivalents forCorpus-based frequency headword list for Basque „EusLemStd“: 58.000 headwords (lemma-signs) that appear both in...

...one of the two very large Basque corpora (20+ occurrences)ETC Hand-selected Basque reference prose corpus (200M tokens, Sarasola, Salaburu & Landa 2013)

Elh200 Basque webcorpus (200M tokens, Leturia 2014)

...one of 6 major lexical resources for Basque (4 dictionaries, 2 NLP resources)

No named entities (proper names, place names)

► Lindemann & San Vicente (2015)

Page 17: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 17/32

[email protected] [email protected]

Bilingual Dictionary Draft: Quantitative evaluation

Headwords: intersecting sets

EusLemStd Basque lemma list 57,919 (100.0%)

EusLemStd ∩ EusWN 18,122 (31.3%)

EusLemStd ∩ EusWN ∩ BabelNet 18,004 (31.0%)

EusLemStd ∩ BabelNet 23,194 (40.0%)

Concepts: intersecting sets

Nounsynsets

Verbsynsets

Adjective synsets

Adverb synsets

Synsets

EusWN ∩ EusLemStd

21,533 2,894 106 0 24,533

BabelNet ∩ EusLemStd

31,028 2,914 293 25 34,260

Page 18: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 18/32

[email protected] [email protected]

Qualitative evaluation: Manual assessment

Translation equivalents:

OK: correct mapping

FUZZY: not false, but without editing not suitable as translation equivalent in a dictionary.

FALSE: incorrect mapping

(cf. Fišer, Gantar & Krek 2012,Lindemann et al. 2014)

MERGE ERROR: In BabelNet, incorrect merging of concepts

Screenshot: Manual assessments in TshwaneLex

Page 19: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 19/32

[email protected] [email protected]

Qualitative evaluation: Manual assessment

Translation equivalents:

OK: correct mapping

FUZZY: not false, but without editing not suitable as translation equivalent in a dictionary.

FALSE: incorrect mapping

(cf. Fišer, Gantar & Krek 2012,Lindemann et al. 2014)

MERGE ERROR: In BabelNet, incorrect merging of concepts

Examples:

OK:

Advanced in years 'aged, elderly, older, senior' – adindun, adineko, edadetu

FUZZY:

First in order of birth 'firstborn, eldest' – zahar[the 'autohyponymy' problem, cf. Pociello et al. 2001]

FALSE:

Provide with a gift 'treat' – hartu, hitz egin, tratatu[mismatch to most common sense of 'treat']

MERGE ERROR:

'Tube, metro, underground' (The London Underground)'Resistance, underground' (A secret group organized to

overthrow the government)

Page 20: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 20/32

[email protected] [email protected]

Qualitative Evaluation: Results for WordNet

EusWN/PWN equivalences Nouns Verbs Adjectives All POS

Total synsets EusWN ∩ EusLemStd

21,533 2894 106 21,533

Monosemous 6,058 201 11 6,270

Polysemous 15,343 2,693 95 18,131

Synsets evaluated 100 100 100 300

Monosemous 50 50 16

Polysemous 50 50 84

Synsets all items OK 87% 75% 94 (94%) 85%

Monosemous 45 (90%) 37 (74%)

Polysemous 42 (84%) 38 (76%)

Synsets OK/FUZZY 98% 94% 96 (96%) 96%

Monosemous 49 (98%) 48 (96%)

Polysemous 49 (98%) 46 (92%)

Synsets 1+ FALSE 2% 7% 4 (4%) 4%

Monosemous 1 (2%) 2 (4%)

Polysemous 1 (2%) 5 (10%)

Page 21: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 21/32

[email protected] [email protected]

Qualitative Evaluation: Results for WordNet

EusWN/PWN equivalences Nouns Verbs Adjectives All POS

Total synsets EusWN ∩ EusLemStd

21,533 2894 106 21,533

Monosemous 6,058 201 11 6,270

Polysemous 15,343 2,693 95 18,131

Synsets evaluated 100 100 100 300

Monosemous 50 50 16

Polysemous 50 50 84

Synsets all items OK 87% 75% 94 (94%) 85%

Monosemous 45 (90%) 37 (74%)

Polysemous 42 (84%) 38 (76%)

Synsets OK/FUZZY 98% 94% 96 (96%) 96%

Monosemous 49 (98%) 48 (96%)

Polysemous 49 (98%) 46 (92%)

Synsets 1+ FALSE 2% 7% 4 (4%) 4%

Monosemous 1 (2%) 2 (4%)

Polysemous 1 (2%) 5 (10%)

Page 22: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 22/32

[email protected] [email protected]

Qualitative Evaluation: Results for BabelNet

BabelNet 3.7 Nouns Verbs Adject. Adverbs Total

Assessed synsets 200 200 200 25 625

All items OK179

(89.5%)163

(81.5%)188

(94.0%)23

(92,0%)553

(88.5%)

1+ items OK, and 1+ items FUZZY

3(1.5%)

14(7.0%)

2(1.0%)

0(0.0%)

19(3.0%)

1+ items OK, and 1+ items FALSE

2(1.0%)

3(1.5%)

0(0.0%)

0(0.0%)

5(0.8%)

All items FUZZY5

(2.5%)9

(5.5%)8

(2.0%)0

(0.0%)22

(3.5%)

1+ items FUZZY, and 1+ items FALSE

1(0.5%)

0(0.0%)

0(0.0%)

0(0.0%)

1(0.5%)

All items FALSE5

(2.5%)8

(4.0%)1

(0.5%)2

(8.0%)16

(2.6%)

MERGE_ERROR5

(2.5%)3

(1.5%)1

(0.5%)0

(0.0%)9

(1.4%)

Page 23: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 23/32

[email protected] [email protected]

Qualitative Evaluation: Results for BabelNet

BabelNet 3.7 Nouns Verbs Adject. Adverbs Total

Assessed synsets 200 200 200 25 625

All items OK179

(89.5%)163

(81.5%)188

(94.0%)23

(92,0%)553

(88.5%)

1+ items OK, and 1+ items FUZZY

3(1.5%)

14(7.0%)

2(1.0%)

0(0.0%)

19(3.0%)

1+ items OK, and 1+ items FALSE

2(1.0%)

3(1.5%)

0(0.0%)

0(0.0%)

5(0.8%)

All items FUZZY5

(2.5%)9

(5.5%)8

(2.0%)0

(0.0%)22

(3.5%)

1+ items FUZZY, and 1+ items FALSE

1(0.5%)

0(0.0%)

0(0.0%)

0(0.0%)

1(0.5%)

All items FALSE5

(2.5%)8

(4.0%)1

(0.5%)2

(8.0%)16

(2.6%)

MERGE_ERROR5

(2.5%)3

(1.5%)1

(0.5%)0

(0.0%)9

(1.4%)

Page 24: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 24/32

[email protected] [email protected]

Qualitative Evaluation: BabelNet sources

BabelNet 3.7 OK FUZZY FALSE MERGE ERROR

(Asses-ments)

All Sources 1,211(88.9%)

63(4.6%)

44(3.2%)

44(3.2%) 1,362

Open Multilingual WordNet

717(89.2%)

49(6.1%)

28(3.5%)

10(1.2%) 804

Wikidata 57(93.4%)

0(0.0%)

1(1.6%)

3(4.9%) 61

Wikipedia 194(87.8%)

5(2.3%)

6(2.7%)

16(7.2%) 221

BabelNet 3(100.0%)

0(0.0%)

0(0.0%)

0(0.0%) 3

Wikipedia Redirections 13(52.0%)

3(12.0%)

4(16.0%)

5(20.0%) 25

OmegaWiki 75(91.5%)

2(2.4%)

0(0.0%)

5(6.1%) 82

Wiktionary 132(92.3%)

4(2.8%)

5(3.5%)

2(1.4%) 143

Microsoft Terminology 20(87.0%)

0(0.0%)

0(0.0%)

3(13.0%) 23

Page 25: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 25/32

[email protected] [email protected]

Qualitative Evaluation: BabelNet sources

BabelNet 3.7 OK FUZZY FALSE MERGE ERROR

(Asses-ments)

All Sources 1,211(88.9%)

63(4.6%)

44(3.2%)

44(3.2%) 1,362

Open Multilingual WordNet

717(89.2%)

49(6.1%)

28(3.5%)

10(1.2%) 804

Wikidata 57(93.4%)

0(0.0%)

1(1.6%)

3(4.9%) 61

Wikipedia 194(87.8%)

5(2.3%)

6(2.7%)

16(7.2%) 221

BabelNet 3(100.0%)

0(0.0%)

0(0.0%)

0(0.0%) 3

Wikipedia Redirections 13(52.0%)

3(12.0%)

4(16.0%)

5(20.0%) 25

OmegaWiki 75(91.5%)

2(2.4%)

0(0.0%)

5(6.1%) 82

Wiktionary 132(92.3%)

4(2.8%)

5(3.5%)

2(1.4%) 143

Microsoft Terminology 20(87.0%)

0(0.0%)

0(0.0%)

3(13.0%) 23

Page 26: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 26/32

[email protected] [email protected]

Post-processing / editing: Some central issues

Creation of a headword-oriented dictionary

Transformation of XML containing dictionary draft

Homonym disambiguationIn WordNet, Wikipedia, BabelNet,homonymy = polysemy

In a dictionary, homonymy ≠ polysemy

Representation of polysemyDoes the draft entry contain all word senses?

Is the splitting of senses......too fine-grained?

...even redundant?

...too coarse-grained?

Other issues: cf. Benjamin 2016

Restrictive licensing of some WordNets

Homonyms in Cambridge Learner‘s Dictionary with CD-ROM, 2007 [img source]

Page 27: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 27/32

[email protected] [email protected]

WN/BN bootstrapping for EUS-EN: Result Overview

Recall on initial Basque headword listEusWN / PWN alone 30%

BabelNet 40%

PrecisionEusWN / PWN alone 90%

BabelNet 90% BabelNetHigher Recall than WN alone

Similar Precision

Page 28: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 28/32

[email protected] [email protected]

WN/BN bootstrapping for EUS-EN: Result Overview

Recall on initial Basque headword listEusWN / PWN alone 30%

BabelNet 40%

PrecisionEusWN / PWN alone 90%

BabelNet 90% BabelNetHigher Recall than WN alone

Similar Precision

Does this approach work with language pairs 'un-resourced'in Bilingual Lexicography?

YES, it does

Page 29: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 29/32

[email protected] [email protected]

Application example: WordNet/BabelNet bootstrapping for EUS-SLO

Basque (EUS) – Slovene (SLO): A totally uncovered pair of 'smaller' languages

Quantitative EvaluationRecall: Synsets that contain 1+ Basque standard headword and 1+ Slovene item

EusWN / SloWNet 20% (66% of 30%)

BabelNet 31% (78% of 40%)

Recall on 5,000 most frequent Basque headwords (BabelNet): 74% (3,707)

Recall on 20,000 most frequent Basque headwords (BabelNet): 53% (10,549)

Qualitative EvaluationPrecision: Unknown. EN-SL precision to be measured first.

Page 30: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 30/32

[email protected] [email protected]

Conclusions and further work

Bilingual Dictionary Draft for Basque-English including sense-to-sense mappingsEncouraging recall and precision rates; can be applied to other language pairs

Preliminaries for a research projectBilingual Dictionary Drafts for many uncovered language pairs

Data model that allowsManual and semi-automated (bulk) editing

Edition of e-dictionaries including more item types

Retro-updating of original resources:'Bootstrapping Loop'

Engagement of lexicographers for editing 'their' language pair

Edition of a new series of bilingual dictionaries with Basque

Imag

e S

ourc

e: W

ikim

edia

Com

mon

s

'Bootstrapping Loop'

Page 31: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 31/32

[email protected] [email protected]

Thank you for your attentionEskerrik asko, bedankt!

[email protected]@uni-hildesheim.de

The research leading to these results has received funding from the Basque Government (Research Group IT665-13).Funding is gratefully acknowledged.

Page 32: Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet · 2017-09-25 · fritz.kliche@uni-hildesheim.de Motivation 400+ languages with 1 million L1-speakers or more Availability

19 Sept. 2017eLex 2017 | BDD: WordNet and BabelNet | 32/32

[email protected] [email protected]

References

Benjamin, M. (2016). Problems and Procedures to Make Wordnet Data (Retro)Fit for a Multilingual Dictionary. In Proceedings of the Eighth Global WordNet Conference (pp. 27–33). Bucharest: Alexandru Ioan Cuza University of Iasi.

Bond, F., & Foster, R. (2013). Linking and Extending an Open Multilingual Wordnet. In Proceedings of the The 51st Annual Meeting of the Association for Computational Linguistics (pp. 1352–1362).

Fišer, D., Gantar, P., & Krek, S. (2012). Using explicitly and implicitly encoded semantic relations to map Slovene Wordnet and Slovene Lexical Database. In Semantic Relations-II. Enhancing Resources and Applications Workshop Programme (p. 77).

Gale, W. A., & Church, K. W. (1991). Identifying Word Correspondences in Parallel Texts. In Proceedings of the ACL Workshop on Speech and Natural Language (pp. 152–157). Stroudsburg, PA: Association for Computational Linguistics.

Héja, E. (2010). The Role of Parallel Corpora in Bilingual Lexicography. In N. Calzolari, K. Choukry, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, … D. Tapias (Eds.), Proceedings of LREC 2010. Valetta.

Kazakov, D., & Shahid, A. R. (2013). Using Parallel Corpora for Word Sense Disambiguation. (pp. 336–341). Proceedings of RANLP 2013, Hissar.

Lefever, E. (2012). ParaSense: parallel corpora for word sense disambiguation (PhD Thesis). Universiteit Gent, Gent.

Lindemann, D., & San Vicente, I. (2015). Building Corpus-based Frequency Lemma Lists. Procedia - Social and Behavioral Sciences, 198, 266–277.

Lindemann, D., & San Vicente, I. (2016). Bilingual Dictionary Drafting: Connecting Basque word senses to multilingual equivalents. In Proceedings of EURALEX 2016 (pp. 898–905). Tbilisi.

Lindemann, D., Saralegi, X., San Vicente, I., Manterola, I., & Nazar, R. (2014). Bilingual Dictionary Drafting. The example of German-Basque, a medium-density language pair. In Proceedings of EURALEX 2012 (pp. 563–576). Bolzano.

Navigli, R., & Ponzetto, S. P. (2010). BabelNet: Building a Very Large Multilingual Semantic Network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 216–225). Stroudsburg.

Pociello, E., Agirre, E. & Aldezabal, I. (2011). Methodology and construction of the Basque WordNet. Language Resources and Evaluation, 45(2), pp. 121–142.

Saralegi, X., Manterola, I., & San Vicente, I. (2012). Building a Basque-Chinese Dictionary by Using English as Pivot. In Proceedings of LREC 2012. Istanbul.

Speer, R., & Havasi, C. (2012). Representing General Relational Knowledge in ConceptNet 5. In Proceedings of LREC 2012. Istanbul.

Varga, I., Yokoyama, S., & Hashimoto, C. (2009). Dictionary generation for less-frequent language pairs using WordNet. Literary and Linguistic Computing, 24(4), 449–466.