ewa rudnicka, wojciech witkowski, maciej piasecki g4.19 research group institute of informatics,...

21
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl Large Polish-English Lexico-Semantic Resource Based on plWordNet - Princeton WordNet Mapping

Upload: archibald-blankenship

Post on 17-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group

Institute of Informatics,

Wrocław University of Technology

nlp.pwr.wroc.pl

plwordnet.pwr.wroc.pl

Large Polish-English Lexico-Semantic

Resource Based on plWordNet - Princeton

WordNet Mapping

Page 2: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Outline

• What is a wordnet?

• Mapping plWordNet on Princeton WordNet

• Extending Princeton WordNet

• Applications

• Conclusions

Page 3: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

What is a wordnet? (1)

A huge electronic lexico-semantic database (a kind of thesaurus)

Basic building blocks:

- lemma – base form representing different inflectional forms and

different meanings

e.g. czwórka – 'good'

- lexical unit – lemma plus sense pair (in wordnets marked with

number)

e.g. czwórka 3 (por – 'communication')

- synset – a set of synonymous lexical units

e.g. {czwórka 3 (por), czwóra 1 (por)}

Page 4: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

What is a wordnet? (2)

Both lexical units and synsets linked via different lexico-semantic relations such as:

synonymy, near-synonymy,

hypernymy/hyponymy,

meronymy/holonymy, fuzzynymy

Examples: Lexical relations:

czwórka 3 (por) has a derivativity relation to czwórka 4 (por)

czwórka 3 (por) has an expressiveness relation to czwóra 1(por) Synset relations:

{czwórka 3 (por), czwóra 1 (por)} is a hyponym of

{stopień 3(il), ocena 1(il), nota 3(il)}

Page 5: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Princeton WordNet

Princeton WordNet (Fellbaum 1998):

the first wordnet ever built on psycholinguistic principles – mapping the structure of

human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for

verbs,

antonym relations for adjectives synsets represent 'lexicalised concepts' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation,

understood as a conceptual relation established on the basis of

linguist's intuitions and dictionary definitions

No major changes since 2006, last version 2012

Page 6: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

plWordNet - Słowosieć

• plWordNet (plWN)• developed fairly independently of

Princeton WordNet (PWN) by applying • a unique corpus-based method

• one of the biggest existing wordnets

Number of plWN PWN enWN

lemmas 156,402 155,593 157,541

lexical units 220,129 206,978 209,147

synsets 162,629 117,659 119,290

Page 7: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

• the emphasis on

• relations between lexical units, new relations specially designed to cover the pecularities of morphosyntactic structure of Polish

• cf. Piasecki et al. 2009, Maziarz et al. 2012

• synsets built of

• lexical units sharing the same set of constitutive relations

• such as hyponymy, hypernymy, meronymy, holonymy

• partly linked to Princeton WordNet

• cf. Rudnicka et al. 2012

plWordNet vs. Princeton WordNet

Page 8: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Mapping plWordNet on Princeton WordNet

• Goal: Linking plWordNet synsets with Princeton Wordnet synsets

• Steps:• Defining a set of inter-lingual relations and setting

their hierarchy

• Designing mapping procedures for nouns and adjectives

• Mapping direction: plWordNet > Princeton WordNet

• Bottom-up approach – starting from the lowest levels in the hierarchy

• Currently mapped lexical categories:

• nouns (most of them), adjectives (about a half)

Page 9: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Automatic prompts

Two systems, based on: 1) relaxation labeling algorithm (nouns)

2) rules relying on the network of the existing

intra and inter-lingual relations (adjectives)

Resource: cascade dictionary Generated prompts: - visible in the form of special links in

WordNetLoom editing system

- verified by lexicographers

Page 10: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

A set of inter-lingual relationsand current statistics

• A set of inter-lingual relations between plWN and PWN

• inspired by:• inter-lingual relations from EuroWordNet (Vossen

2002)

• intra-lingual relations from plWordNet (Maziarz et al. 2011)

• Statistics of the established inter-lingual links:

Nouns Adjectives1. Synonymy 28 736 3 1992. Partial synonymy 2 580 1 0033. Inter-register synonymy 1 510 354. Hyponymy 57 029 6 5615. Hypernymy 3 744 346. Meronymy 6 0347. Holonymy 1 204 8. Cross-categorial synonymy 3 891

Page 11: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Motivation for the extension of Princeton WordNet

the high percentage of inter-lingual hyponymy links between plWordNet and Princeton WordNet synsets

Established due to a number of lexical coverage gaps in Princeton WordNet

And the resulting impossibility to establish much more informative and useful inter-lingual synonymy links

possible to be used as ‘pointers’ to specific

Princeton WordNet gaps (‘missing’ lexical units) and whole ‘empty nests’ (several missing co-hyponyms of one hypernym synset) in the network

Page 12: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Inter-lingual hyponymy links

Page 13: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

General extension procedure

• The starting point -- existing inter-lingual hyponymy links

• Lemmas of plWordNet synsets translated by a cascade dictionary• Which combines several traditional dictionaries, the data

ordered in the hierarchy of importance; the topmost gaining more priority

the results are filtered by lemmas of Princeton WordNet, to gain:• A list of plWN lemmas with the ‘equivalent’ cascade

dictionary lemmas absent from PWN• A list of plWN lemmas without the ‘equivalent’ cascade

dictionary lemmas• A list of plWN lemmas with the ‘equivalent’ cascade

dictionary lemmas present in PWN

Page 14: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Extension procedure

• Start is at the lowest level of hierarchy• in order not to change the structure of the original Princeton WordNet

• Verification of the suggested English equivalent(s)• in corpora and other reliable sources

• on the basis of

• the researcher’s knowledge

• dictionaries

• frequency lists from corpora

• Creation of the new Princeton WordNet synset

• The synset is linked

• via intra-lingual hyponymy relation to a proper PWN hypernym synset

• via inter-lingual synonymy relation to its direct counterpart in plWordNet

Page 15: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Extension results

• Each added synset provided with:• a definition

• major source - English Wikipedia

• a usage example • from a corpus or

• other reliable English source

• Total number of selected plWN synsets --- 42785

• Domains selected for the first stage :• shape (156)

• substance (1181)

• quantity (547)

• food (885)

• property (1492)

Page 16: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Extension via plWN.Pros and cons

• Pros:• There is a definite vocabulary basis for the

extension

• New synsets can be easily and safely located in the structure of the original PWN

• Cons:• Polish orientation of the extension

• Addition of lexical units related to strictly Polish domains

Page 17: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Extension via corpora data.An alternative strategy

• This extesion procedure uses frequency lists derived from:• British National Corpus • Wacky corpus• Corpus of Contemporary American English• American National Corpus• English Wikipedia

• Independent of plWordNet• Criterion for inclusion of a new lexical unit

• its appearance in five different texts

Page 18: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Pros and cons

• Pros:• English oriented

• no Polish bias• Cons:• new synsets have to be introduced at different

levels of the PWN hierarchy

• there is a risk of changing • the structure of the original PWN

Page 19: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Cross-lingual Applications

Cross-lingual Semantic searching, Semantic indexing of texts, Text classification, Statistical semantic analysis of corpora in

different languages Information Extraction, Machine Translation

Multi-lingual Princeton WordNet 3.1 is linked to more than

60 languages

Page 20: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Conclusions

• The created bilingual resource will become a gateway to CLARIN bilingual resources

• It has a number of practical applications

• Princeton WordNet can be enriched and updated

• Extension of Princeton WordNet allows one to replace

• the existing inter-lingual hyponymy links between plWN and PWN synsets with

• more precise and useful inter-lingual synonymy links

Page 21: Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

References

Fellbaum, Ch. (ed). 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge, Massachusets.

Maziarz, M., Piasecki, M. and S. Szpakowicz. 2012. Approaching plWordNet 2.0. Proceedings of the 6th Global Wordnet Conference, Matsue.

Piasecki, M., Maziarz, M. Szpakowicz, S & Rudnicka, E. (2014). plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources. W Proc. 7th International Global Wordnet Conference.

Princeton WordNet http://wordnet.princeton.edu/wordnet/

Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A Strategy of Mapping Polish WordNet onto Princeton WordNet. In Proceedings of COLING 2012. ACL.

Słowosieć http://plwordnet.pwr.wroc.pl/wordnet/

Vossen, P. (ed). 2002. EuroWordNet. General Document. Amsterdam.