extraction of ontological information from lexicon and corpora
DESCRIPTION
Extraction of Ontological Information from Lexicon and Corpora. Dimitrios Kokkinakis Maria Toporowska Gronostaj. Motto. To process information you need information P. Vossen, 2003. Content. Introduction Background Language resources Methodology - PowerPoint PPT PresentationTRANSCRIPT
1 Oslo, 14-16 Sep 2003
Extraction of OntologicalInformation from
Lexicon and Corpora
Dimitrios KokkinakisMaria Toporowska Gronostaj
2 Oslo, 14-16 Sep 2003
Motto
To process information
you need information
P. Vossen, 2003
3 Oslo, 14-16 Sep 2003
Content
•Introduction
Background Language resources Methodology
•Lexicon-driven extraction of ontological data
•Corpus-driven extraction of ontological data
•Conclusions
4 Oslo, 14-16 Sep 2003
Background
• What is ontological information ?
information necessary for making common-sense-like inferences based on our knowledge of the world
• How is it represented?
in form of structured sets of conceptual types often inclusive semantic relations underlying them
• Where?
SIMPLE-ontology, EWN, LexiQuest
5 Oslo, 14-16 Sep 2003
Background
Why is ontological information relevant for NLP?
• promotes development of lexicon resources which aim at text-understanding as it offers disambiguation means
• provides knowledge needed in– machine translation (MT)
– information retrieval (IR)
– information extraction (IE)
– summarization
– computer aided language learning (CALL)
• enables communication on the Semantic Web
6 Oslo, 14-16 Sep 2003
What is meant with a semi-automatic extraction of OI?
• some human intervention is involved in information processing to maximize its effects
What will we achieve with it?
• enhance the content of the Swedish SIMPLE lexicon in a quick and costs-effective way
• investigate lexicon-driven and corpus-driven methodologies
Background
7 Oslo, 14-16 Sep 2003
Methodology in general (1)
Methodological assumptions:
• lexical databases, MRD lexica and corpora can be mined for ontological information
• relevant factors in information processing :
resource size
degree of extractability
implicitness and explicitness of information
bootstrapping
8 Oslo, 14-16 Sep 2003
Methodology in general (2)
Approach: text data mining (TDM)
TDM is a process of exploratory data analysis using text that leads to the discovery of heretofore unknown information, or to answers to questions for which the answer is not currently known (Mitkov 2003, Hearst 2003)
Result: evolutionary lexicon modeloutput data are reused to discover new data, which leadsto a successive enlargement of lexicon
9 Oslo, 14-16 Sep 2003
Language resources SIMPLE-SE (1)
Corpora
150 million words i Språkbanken
Lexicon resources
SIMPLE-SE lexicon
GLDB Göteborg lexical database
SEMNET
10 Oslo, 14-16 Sep 2003
Language resources SIMPLE-SE (2)
About SIMPLE-SE – computational lexicon with explicit ontological
information (OI)
• 10 000 lexicon units– 7 000 nouns, 2 000 verbs, 1 000 adjectives
• manually annotated with semantic and OI which is linked to the morphosyntactic information in the PAROLE lexicon
• multidimensional
11 Oslo, 14-16 Sep 2003
Language resources SIMPLE-SE (3)
SIMPLE-SE supports
word sense disambiguationkastanji 1/1/0 FRUIT
kastanji 1/1/1 PLANT
kastanji 1/1/2sms COLOUR
kastanji 1/1/3 FOOD
kastanji 1/2/0 ORGANIC OBJECT
finding regular polysemy
creating multilingual links between lexicons
12 Oslo, 14-16 Sep 2003
Language resources SIMPLE-SE (4)
SIMPLE-SE supports: – text annotation– text data mining & knowledge based
information processing– evaluation – pattern matching based on the ontological
information assigned to arguments (selection restrictions/preferences)
13 Oslo, 14-16 Sep 2003
Language resources SIMPLE-SE (5)
selection restriction based pattern matching
Word/expression Position Ontological term
injicera (inject) object Substance
bebo (inhabit) object Area
griljera (roast) object Food
förlova sig (become engaged) subj., prep. obj Human
devalvera (devaluate) obj. Money
ha ont i (have pain in) prep. obj. Body part
14 Oslo, 14-16 Sep 2003
Language resources GLDB
Göteborg lexical database, GLDB 67 000 core senses with stringent definition format
implicit, but extractable genus proximum (genus word)
implicit onto info about arguments in definition extensions
35 000 explicit semantic references on semantic relations like
synonymy, antonymy, hyperonymy, hyponymy and
cohyponymy
15 Oslo, 14-16 Sep 2003
Language resources SEMNET (1)
SEMNET hyperonymic taxonomy
Extraction of hyperonymy relations from GLDBs definitions
– (methodology & software Y. Cederholm, 1999)
Recognition of headwords (genus proximum) in definitions
16 Oslo, 14-16 Sep 2003
Language resources SEMNET (2)
Input data:
GLDB definitions
44 915 noun lexeme
10 082 verb lexeme
Two analysis methods which complete each other
17 Oslo, 14-16 Sep 2003
Language resources SEMNET (3)
Method I
distinguishing typical def. patterns for core senses
(see overhead/handout from Cederholm Y. 1999, Tabell 1. Definitionsformler))
pattern matching against non-lemmatized
definitions (using regular expressions)
18 Oslo, 14-16 Sep 2003
Language resources SEMNET (4)
Method II
Input: lemmatized definitions
Assumptions:– genus word is the first word in the definition
which matches the part of speech of the headword, the word being defined
– method II finds even those genus words which cannot be parsed with the method I
19 Oslo, 14-16 Sep 2003
Language resources SEMNET (5)
Analysis results for nouns
tot. number of analysis tot. number of correct analysis
Method I 8127 (64%) 7141 (56%)
Method II 12 194 (95%) 8974 (70%)
Method I + II 12 528 (98%) 10536 ( 83%)
(evaluation based on 12 786 manually annotated noun genus words)
Approximated result for ca 45 000 nouns i genus position:
36 500 correctly recognised noun genus words
20 Oslo, 14-16 Sep 2003
Language resources SEMNET (6)
The 33 most frequent noun genus words i SEMNET 2702 person 858 typ 612 del
461 anordning 314 område 261 kvinna
228 tillstånd 219 lära 217 titel
207 grupp 183 föremål 173 sammanfattning
172 mängd 169 sätt 167 plats
166 system 165 växt 162 ämne
153 apparat 145 förmåga 133 medlem
128 språk 122 stycke 122 redskap
122 plats 119 känsla 118 form
116 metod 116 handling 113 enhet
111 ljud 110 instrument 102 verksamhet
21 Oslo, 14-16 Sep 2003
Language resources SEMNET (7)
Hyperonymy taxonomy sjukdomsjukdom
-- [1] akutfall 1/1
-- [2] almsjuka 1/1
-- [3] astma 1/1
-- [4] avitaminos 1/1
-- [5] basedow 1/1
-- [6] bladrullsjuka 1/1
-- [7] blodkräfta 1/1
-- [8] blodsjukdom 1/1
-- [9] blödarsjuka 1/1............................................ (totalt 66 hyponyms)
22 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (1)
Resources: SIMPLE-SE + SEMNET + GLDB
Methodological assumptions
Hyperonymic taxonomy in combination with ontological information in SIMPLE-SE supports semiautomatic extraction of ontological information
Procedure:
Preparatory phase relevant for all ontological processing: annotate GLDB data with the ontol. info from the SIMPLE-SE to generate ontologically enriched SEMNET
23 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (2)
Methodological assumptions (cont.)
The extracted ontological information is an approximation
of ontological category until verified with other methods,
t.ex. a corpus-driven methodology, semantic/ontological data
från GLDB or pattern matching based on selection
restrictions
Since annotated words in SIMPLE cover both hyperonyms
and hyponyms, two methods are proposed here that put in
focus each of these semantic categories
24 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (3)
Method I:
from annotated hyponyms to new annotations of hyperonyms
Assumption One can approximate ontological category of a hyperonym given
some information on its hyponyms and using the structural knowledge inherent in ontology
Annotation of a hyperonym can be performed if all of the annotated hyponyms share the same ontological tag or if the tags share a common superordinate tag, except the tag Entity which is ontologically heterogeneous and thus relatively uninformative
25 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (4)
Method I example
Hyponyms known info
diabetes [Disease] cat [Air animal],
asthma [Disease] dog [Air animal]
cholera [Disease] fisk [Water_animal]
Hyperonym new info
disease =>[Disease] djur => [Animal]
26 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (5)
Method II
from annotated hyperonyms to new annotations of hyponyms
Assumption (resulting in approximation)
Direct hyponyms (hyponyms which are directly subordinated to the genus word/hyperonym) automatically inherit the ontological category of their hyperonyms och therefore manual annotation of the most frequent genus words/hyperonyms can be recommended and justified.
hyperonym known info hyponyms new info
myntenhet [Money] => dollar, krona, pund, rubel... [Money]
27 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (6)
The assumption has far reaching consequences for all those annotated hyponymic words which also occur as genus words, since their subordinates can automatically inherit the ontological class from the hyperonym/genus word.
Cascade effect:
sjukdom (disease) 66 hyponymes
+ infektionssjukdom 25 hyponyms
+ könsjukdom 4 hyponyms
28 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (7)
Cascade distribution of the ontological type [Animal]
Djur 102 hyponyms
+ hovdjur 10+ ryggradsdjur 8
+ fågel 98 + däggdjur 18
Note: 80 most frequent genus words, when ontologically annotated, give rise to 11 000 automatically annotated genus words at the first hyponymy level. This number further increases due to the cascade effect.
29 Oslo, 14-16 Sep 2003
Definition-driven extraction of
ontological information (8)
2702 person 1/1 person HUMAN
461 anordning 1/*1 device ARTIFACT
314 område 1/1 area AREA (>LOCATION)
261 kvinna 1/1 woman HUMAN
238 tillstånd 1/* state STATE
219 lära 1/*1 doctrine DOMAIN
217 titel 1/*1 titel SOCIAL_STATUS (>HUMAN)
183 föremål 1/* thing CONCRETE_ENTITY
169 sätt 1/*1 manner CONSTITUTIVE
167 plats 1/*1el 4 place LOCATION
166 system 1/*1 system CONSTITUTIVE
165 växt 1/*1 plant PLANT
30 Oslo, 14-16 Sep 2003
Conclusion
• Ontological annotations are approximations. They need to be verified against manually annotated data and/or by means of corpus-driven methodology for extracting ontological information
• The status of ontological annotations need to be explicitly specified in the database
• Method I (from hyponyms to hyperonyms) seem to complement
the method II (from hyperonyms to hyponyms) since the range
of annotated categories increases rapidly
• The quality (and quantity) of the used lexical resources
determines the precision of the acquired results – ontology
31 Oslo, 14-16 Sep 2003
Conclusion cont’d
To prevent overgenerating of incorrect ontological annotation special attention needs to be paid to:
disambiguation of polysemous and homographic genus words (hyperonyms)
krona [Artifact], [Money], [Part]
analysis of compound nouns
gosedjur [Artifact] vs husdjur [Animal]
32 Oslo, 14-16 Sep 2003
References
Cederholm Y. 1999. Automatisk konstruktion av en hyperonymitaxonomi baserad på definitioner i GLDB. In Från dataskärm och forskarpärm. MISS 25. Göteborgs universitet.
Hearst, M. 2003. Text Data Mining. In ed. R. Mitkov The Oxford Handbook of Computational Linguistics Oxford.
Mitkov, R. 2003. The Oxford Handbook of Computational Linguistics Oxford. Oxford University Press.
Vossen, P. 2003. Ontologies. In ed. R. Mitkov The Oxford Handbook of Computational Linguistics Oxford.
about SIMPLE see http://spraakbanken.gu.se