ewa rudnicka, wojciech witkowski, maciej piasecki g4.19 research group institute of informatics,...
TRANSCRIPT
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group
Institute of Informatics,
Wrocław University of Technology
nlp.pwr.wroc.pl
plwordnet.pwr.wroc.pl
Large Polish-English Lexico-Semantic
Resource Based on plWordNet - Princeton
WordNet Mapping
Outline
• What is a wordnet?
• Mapping plWordNet on Princeton WordNet
• Extending Princeton WordNet
• Applications
• Conclusions
What is a wordnet? (1)
A huge electronic lexico-semantic database (a kind of thesaurus)
Basic building blocks:
- lemma – base form representing different inflectional forms and
different meanings
e.g. czwórka – 'good'
- lexical unit – lemma plus sense pair (in wordnets marked with
number)
e.g. czwórka 3 (por – 'communication')
- synset – a set of synonymous lexical units
e.g. {czwórka 3 (por), czwóra 1 (por)}
What is a wordnet? (2)
Both lexical units and synsets linked via different lexico-semantic relations such as:
synonymy, near-synonymy,
hypernymy/hyponymy,
meronymy/holonymy, fuzzynymy
Examples: Lexical relations:
czwórka 3 (por) has a derivativity relation to czwórka 4 (por)
czwórka 3 (por) has an expressiveness relation to czwóra 1(por) Synset relations:
{czwórka 3 (por), czwóra 1 (por)} is a hyponym of
{stopień 3(il), ocena 1(il), nota 3(il)}
Princeton WordNet
Princeton WordNet (Fellbaum 1998):
the first wordnet ever built on psycholinguistic principles – mapping the structure of
human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for
verbs,
antonym relations for adjectives synsets represent 'lexicalised concepts' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation,
understood as a conceptual relation established on the basis of
linguist's intuitions and dictionary definitions
No major changes since 2006, last version 2012
plWordNet - Słowosieć
• plWordNet (plWN)• developed fairly independently of
Princeton WordNet (PWN) by applying • a unique corpus-based method
• one of the biggest existing wordnets
Number of plWN PWN enWN
lemmas 156,402 155,593 157,541
lexical units 220,129 206,978 209,147
synsets 162,629 117,659 119,290
• the emphasis on
• relations between lexical units, new relations specially designed to cover the pecularities of morphosyntactic structure of Polish
• cf. Piasecki et al. 2009, Maziarz et al. 2012
• synsets built of
• lexical units sharing the same set of constitutive relations
• such as hyponymy, hypernymy, meronymy, holonymy
• partly linked to Princeton WordNet
• cf. Rudnicka et al. 2012
plWordNet vs. Princeton WordNet
Mapping plWordNet on Princeton WordNet
• Goal: Linking plWordNet synsets with Princeton Wordnet synsets
• Steps:• Defining a set of inter-lingual relations and setting
their hierarchy
• Designing mapping procedures for nouns and adjectives
• Mapping direction: plWordNet > Princeton WordNet
• Bottom-up approach – starting from the lowest levels in the hierarchy
• Currently mapped lexical categories:
• nouns (most of them), adjectives (about a half)
Automatic prompts
Two systems, based on: 1) relaxation labeling algorithm (nouns)
2) rules relying on the network of the existing
intra and inter-lingual relations (adjectives)
Resource: cascade dictionary Generated prompts: - visible in the form of special links in
WordNetLoom editing system
- verified by lexicographers
A set of inter-lingual relationsand current statistics
• A set of inter-lingual relations between plWN and PWN
• inspired by:• inter-lingual relations from EuroWordNet (Vossen
2002)
• intra-lingual relations from plWordNet (Maziarz et al. 2011)
• Statistics of the established inter-lingual links:
Nouns Adjectives1. Synonymy 28 736 3 1992. Partial synonymy 2 580 1 0033. Inter-register synonymy 1 510 354. Hyponymy 57 029 6 5615. Hypernymy 3 744 346. Meronymy 6 0347. Holonymy 1 204 8. Cross-categorial synonymy 3 891
Motivation for the extension of Princeton WordNet
the high percentage of inter-lingual hyponymy links between plWordNet and Princeton WordNet synsets
Established due to a number of lexical coverage gaps in Princeton WordNet
And the resulting impossibility to establish much more informative and useful inter-lingual synonymy links
possible to be used as ‘pointers’ to specific
Princeton WordNet gaps (‘missing’ lexical units) and whole ‘empty nests’ (several missing co-hyponyms of one hypernym synset) in the network
Inter-lingual hyponymy links
General extension procedure
• The starting point -- existing inter-lingual hyponymy links
• Lemmas of plWordNet synsets translated by a cascade dictionary• Which combines several traditional dictionaries, the data
ordered in the hierarchy of importance; the topmost gaining more priority
the results are filtered by lemmas of Princeton WordNet, to gain:• A list of plWN lemmas with the ‘equivalent’ cascade
dictionary lemmas absent from PWN• A list of plWN lemmas without the ‘equivalent’ cascade
dictionary lemmas• A list of plWN lemmas with the ‘equivalent’ cascade
dictionary lemmas present in PWN
Extension procedure
• Start is at the lowest level of hierarchy• in order not to change the structure of the original Princeton WordNet
• Verification of the suggested English equivalent(s)• in corpora and other reliable sources
• on the basis of
• the researcher’s knowledge
• dictionaries
• frequency lists from corpora
• Creation of the new Princeton WordNet synset
• The synset is linked
• via intra-lingual hyponymy relation to a proper PWN hypernym synset
• via inter-lingual synonymy relation to its direct counterpart in plWordNet
Extension results
• Each added synset provided with:• a definition
• major source - English Wikipedia
• a usage example • from a corpus or
• other reliable English source
• Total number of selected plWN synsets --- 42785
• Domains selected for the first stage :• shape (156)
• substance (1181)
• quantity (547)
• food (885)
• property (1492)
Extension via plWN.Pros and cons
• Pros:• There is a definite vocabulary basis for the
extension
• New synsets can be easily and safely located in the structure of the original PWN
• Cons:• Polish orientation of the extension
• Addition of lexical units related to strictly Polish domains
Extension via corpora data.An alternative strategy
• This extesion procedure uses frequency lists derived from:• British National Corpus • Wacky corpus• Corpus of Contemporary American English• American National Corpus• English Wikipedia
• Independent of plWordNet• Criterion for inclusion of a new lexical unit
• its appearance in five different texts
Pros and cons
• Pros:• English oriented
• no Polish bias• Cons:• new synsets have to be introduced at different
levels of the PWN hierarchy
• there is a risk of changing • the structure of the original PWN
Cross-lingual Applications
Cross-lingual Semantic searching, Semantic indexing of texts, Text classification, Statistical semantic analysis of corpora in
different languages Information Extraction, Machine Translation
Multi-lingual Princeton WordNet 3.1 is linked to more than
60 languages
Conclusions
• The created bilingual resource will become a gateway to CLARIN bilingual resources
• It has a number of practical applications
• Princeton WordNet can be enriched and updated
• Extension of Princeton WordNet allows one to replace
• the existing inter-lingual hyponymy links between plWN and PWN synsets with
• more precise and useful inter-lingual synonymy links
References
Fellbaum, Ch. (ed). 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge, Massachusets.
Maziarz, M., Piasecki, M. and S. Szpakowicz. 2012. Approaching plWordNet 2.0. Proceedings of the 6th Global Wordnet Conference, Matsue.
Piasecki, M., Maziarz, M. Szpakowicz, S & Rudnicka, E. (2014). plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources. W Proc. 7th International Global Wordnet Conference.
Princeton WordNet http://wordnet.princeton.edu/wordnet/
Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A Strategy of Mapping Polish WordNet onto Princeton WordNet. In Proceedings of COLING 2012. ACL.
Słowosieć http://plwordnet.pwr.wroc.pl/wordnet/
Vossen, P. (ed). 2002. EuroWordNet. General Document. Amsterdam.