extraction of ontological information from lexicon and corpora

32
1 Oslo, 14-16 Sep 2003 Extraction of Ontological Information from Lexicon and Corpora Dimitrios Kokkinakis Maria Toporowska Gronostaj

Upload: amalie

Post on 12-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Extraction of Ontological Information from Lexicon and Corpora. Dimitrios Kokkinakis Maria Toporowska Gronostaj. Motto. To process information you need information P. Vossen, 2003. Content. Introduction Background Language resources Methodology - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Extraction of Ontological Information from Lexicon and Corpora

1 Oslo, 14-16 Sep 2003

Extraction of OntologicalInformation from

Lexicon and Corpora

Dimitrios KokkinakisMaria Toporowska Gronostaj

Page 2: Extraction of Ontological Information from Lexicon and Corpora

2 Oslo, 14-16 Sep 2003

Motto

To process information

you need information

P. Vossen, 2003

Page 3: Extraction of Ontological Information from Lexicon and Corpora

3 Oslo, 14-16 Sep 2003

Content

•Introduction

Background Language resources Methodology

•Lexicon-driven extraction of ontological data

•Corpus-driven extraction of ontological data

•Conclusions

Page 4: Extraction of Ontological Information from Lexicon and Corpora

4 Oslo, 14-16 Sep 2003

Background

• What is ontological information ?

information necessary for making common-sense-like inferences based on our knowledge of the world

• How is it represented?

in form of structured sets of conceptual types often inclusive semantic relations underlying them

• Where?

SIMPLE-ontology, EWN, LexiQuest

Page 5: Extraction of Ontological Information from Lexicon and Corpora

5 Oslo, 14-16 Sep 2003

Background

Why is ontological information relevant for NLP?

• promotes development of lexicon resources which aim at text-understanding as it offers disambiguation means

• provides knowledge needed in– machine translation (MT)

– information retrieval (IR)

– information extraction (IE)

– summarization

– computer aided language learning (CALL)

• enables communication on the Semantic Web

Page 6: Extraction of Ontological Information from Lexicon and Corpora

6 Oslo, 14-16 Sep 2003

What is meant with a semi-automatic extraction of OI?

• some human intervention is involved in information processing to maximize its effects

What will we achieve with it?

• enhance the content of the Swedish SIMPLE lexicon in a quick and costs-effective way

• investigate lexicon-driven and corpus-driven methodologies

Background

Page 7: Extraction of Ontological Information from Lexicon and Corpora

7 Oslo, 14-16 Sep 2003

Methodology in general (1)

Methodological assumptions:

• lexical databases, MRD lexica and corpora can be mined for ontological information

• relevant factors in information processing :

resource size

degree of extractability

implicitness and explicitness of information

bootstrapping

Page 8: Extraction of Ontological Information from Lexicon and Corpora

8 Oslo, 14-16 Sep 2003

Methodology in general (2)

Approach: text data mining (TDM)

TDM is a process of exploratory data analysis using text that leads to the discovery of heretofore unknown information, or to answers to questions for which the answer is not currently known (Mitkov 2003, Hearst 2003)

Result: evolutionary lexicon modeloutput data are reused to discover new data, which leadsto a successive enlargement of lexicon

Page 9: Extraction of Ontological Information from Lexicon and Corpora

9 Oslo, 14-16 Sep 2003

Language resources SIMPLE-SE (1)

Corpora

150 million words i Språkbanken

Lexicon resources

SIMPLE-SE lexicon

GLDB Göteborg lexical database

SEMNET

Page 10: Extraction of Ontological Information from Lexicon and Corpora

10 Oslo, 14-16 Sep 2003

Language resources SIMPLE-SE (2)

About SIMPLE-SE – computational lexicon with explicit ontological

information (OI)

• 10 000 lexicon units– 7 000 nouns, 2 000 verbs, 1 000 adjectives

• manually annotated with semantic and OI which is linked to the morphosyntactic information in the PAROLE lexicon

• multidimensional

Page 11: Extraction of Ontological Information from Lexicon and Corpora

11 Oslo, 14-16 Sep 2003

Language resources SIMPLE-SE (3)

SIMPLE-SE supports

word sense disambiguationkastanji 1/1/0 FRUIT

kastanji 1/1/1 PLANT

kastanji 1/1/2sms COLOUR

kastanji 1/1/3 FOOD

kastanji 1/2/0 ORGANIC OBJECT

finding regular polysemy

creating multilingual links between lexicons

Page 12: Extraction of Ontological Information from Lexicon and Corpora

12 Oslo, 14-16 Sep 2003

Language resources SIMPLE-SE (4)

SIMPLE-SE supports: – text annotation– text data mining & knowledge based

information processing– evaluation – pattern matching based on the ontological

information assigned to arguments (selection restrictions/preferences)

Page 13: Extraction of Ontological Information from Lexicon and Corpora

13 Oslo, 14-16 Sep 2003

Language resources SIMPLE-SE (5)

selection restriction based pattern matching

Word/expression Position Ontological term

injicera (inject) object Substance

bebo (inhabit) object Area

griljera (roast) object Food

förlova sig (become engaged) subj., prep. obj Human

devalvera (devaluate) obj. Money

ha ont i (have pain in) prep. obj. Body part

Page 14: Extraction of Ontological Information from Lexicon and Corpora

14 Oslo, 14-16 Sep 2003

Language resources GLDB

Göteborg lexical database, GLDB 67 000 core senses with stringent definition format

implicit, but extractable genus proximum (genus word)

implicit onto info about arguments in definition extensions

35 000 explicit semantic references on semantic relations like

synonymy, antonymy, hyperonymy, hyponymy and

cohyponymy

Page 15: Extraction of Ontological Information from Lexicon and Corpora

15 Oslo, 14-16 Sep 2003

Language resources SEMNET (1)

SEMNET hyperonymic taxonomy

Extraction of hyperonymy relations from GLDBs definitions

– (methodology & software Y. Cederholm, 1999)

Recognition of headwords (genus proximum) in definitions

Page 16: Extraction of Ontological Information from Lexicon and Corpora

16 Oslo, 14-16 Sep 2003

Language resources SEMNET (2)

Input data:

GLDB definitions

44 915 noun lexeme

10 082 verb lexeme

Two analysis methods which complete each other

Page 17: Extraction of Ontological Information from Lexicon and Corpora

17 Oslo, 14-16 Sep 2003

Language resources SEMNET (3)

Method I

distinguishing typical def. patterns for core senses

(see overhead/handout from Cederholm Y. 1999, Tabell 1. Definitionsformler))

pattern matching against non-lemmatized

definitions (using regular expressions)

Page 18: Extraction of Ontological Information from Lexicon and Corpora

18 Oslo, 14-16 Sep 2003

Language resources SEMNET (4)

Method II

Input: lemmatized definitions

Assumptions:– genus word is the first word in the definition

which matches the part of speech of the headword, the word being defined

– method II finds even those genus words which cannot be parsed with the method I

Page 19: Extraction of Ontological Information from Lexicon and Corpora

19 Oslo, 14-16 Sep 2003

Language resources SEMNET (5)

Analysis results for nouns

tot. number of analysis tot. number of correct analysis

Method I 8127 (64%) 7141 (56%)

Method II 12 194 (95%) 8974 (70%)

Method I + II 12 528 (98%) 10536 ( 83%)

(evaluation based on 12 786 manually annotated noun genus words)

Approximated result for ca 45 000 nouns i genus position:

36 500 correctly recognised noun genus words

Page 20: Extraction of Ontological Information from Lexicon and Corpora

20 Oslo, 14-16 Sep 2003

Language resources SEMNET (6)

The 33 most frequent noun genus words i SEMNET 2702 person 858 typ 612 del

461 anordning 314 område 261 kvinna

228 tillstånd 219 lära 217 titel

207 grupp 183 föremål 173 sammanfattning

172 mängd 169 sätt 167 plats

166 system 165 växt 162 ämne

153 apparat 145 förmåga 133 medlem

128 språk 122 stycke 122 redskap

122 plats 119 känsla 118 form

116 metod 116 handling 113 enhet

111 ljud 110 instrument 102 verksamhet

Page 21: Extraction of Ontological Information from Lexicon and Corpora

21 Oslo, 14-16 Sep 2003

Language resources SEMNET (7)

Hyperonymy taxonomy sjukdomsjukdom

-- [1] akutfall 1/1

-- [2] almsjuka 1/1

-- [3] astma 1/1

-- [4] avitaminos 1/1

-- [5] basedow 1/1

-- [6] bladrullsjuka 1/1

-- [7] blodkräfta 1/1

-- [8] blodsjukdom 1/1

-- [9] blödarsjuka 1/1............................................ (totalt 66 hyponyms)

Page 22: Extraction of Ontological Information from Lexicon and Corpora

22 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (1)

Resources: SIMPLE-SE + SEMNET + GLDB

Methodological assumptions

Hyperonymic taxonomy in combination with ontological information in SIMPLE-SE supports semiautomatic extraction of ontological information

Procedure:

Preparatory phase relevant for all ontological processing: annotate GLDB data with the ontol. info from the SIMPLE-SE to generate ontologically enriched SEMNET

Page 23: Extraction of Ontological Information from Lexicon and Corpora

23 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (2)

Methodological assumptions (cont.)

The extracted ontological information is an approximation

of ontological category until verified with other methods,

t.ex. a corpus-driven methodology, semantic/ontological data

från GLDB or pattern matching based on selection

restrictions

Since annotated words in SIMPLE cover both hyperonyms

and hyponyms, two methods are proposed here that put in

focus each of these semantic categories

Page 24: Extraction of Ontological Information from Lexicon and Corpora

24 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (3)

Method I:

from annotated hyponyms to new annotations of hyperonyms

Assumption One can approximate ontological category of a hyperonym given

some information on its hyponyms and using the structural knowledge inherent in ontology

Annotation of a hyperonym can be performed if all of the annotated hyponyms share the same ontological tag or if the tags share a common superordinate tag, except the tag Entity which is ontologically heterogeneous and thus relatively uninformative

Page 25: Extraction of Ontological Information from Lexicon and Corpora

25 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (4)

Method I example

Hyponyms known info

diabetes [Disease] cat [Air animal],

asthma [Disease] dog [Air animal]

cholera [Disease] fisk [Water_animal]

Hyperonym new info

disease =>[Disease] djur => [Animal]

Page 26: Extraction of Ontological Information from Lexicon and Corpora

26 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (5)

Method II

from annotated hyperonyms to new annotations of hyponyms

Assumption (resulting in approximation)

Direct hyponyms (hyponyms which are directly subordinated to the genus word/hyperonym) automatically inherit the ontological category of their hyperonyms och therefore manual annotation of the most frequent genus words/hyperonyms can be recommended and justified.

hyperonym known info hyponyms new info

myntenhet [Money] => dollar, krona, pund, rubel... [Money]

Page 27: Extraction of Ontological Information from Lexicon and Corpora

27 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (6)

The assumption has far reaching consequences for all those annotated hyponymic words which also occur as genus words, since their subordinates can automatically inherit the ontological class from the hyperonym/genus word.

Cascade effect:

sjukdom (disease) 66 hyponymes

+ infektionssjukdom 25 hyponyms

+ könsjukdom 4 hyponyms

Page 28: Extraction of Ontological Information from Lexicon and Corpora

28 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (7)

Cascade distribution of the ontological type [Animal]

Djur 102 hyponyms

+ hovdjur 10+ ryggradsdjur 8

+ fågel 98 + däggdjur 18

Note: 80 most frequent genus words, when ontologically annotated, give rise to 11 000 automatically annotated genus words at the first hyponymy level. This number further increases due to the cascade effect.

Page 29: Extraction of Ontological Information from Lexicon and Corpora

29 Oslo, 14-16 Sep 2003

Definition-driven extraction of

ontological information (8)

2702 person 1/1 person HUMAN

461 anordning 1/*1 device ARTIFACT

314 område 1/1 area AREA (>LOCATION)

261 kvinna 1/1 woman HUMAN

238 tillstånd 1/* state STATE

219 lära 1/*1 doctrine DOMAIN

217 titel 1/*1 titel SOCIAL_STATUS (>HUMAN)

183 föremål 1/* thing CONCRETE_ENTITY

169 sätt 1/*1 manner CONSTITUTIVE

167 plats 1/*1el 4 place LOCATION

166 system 1/*1 system CONSTITUTIVE

165 växt 1/*1 plant PLANT

Page 30: Extraction of Ontological Information from Lexicon and Corpora

30 Oslo, 14-16 Sep 2003

Conclusion

• Ontological annotations are approximations. They need to be verified against manually annotated data and/or by means of corpus-driven methodology for extracting ontological information

• The status of ontological annotations need to be explicitly specified in the database

• Method I (from hyponyms to hyperonyms) seem to complement

the method II (from hyperonyms to hyponyms) since the range

of annotated categories increases rapidly

• The quality (and quantity) of the used lexical resources

determines the precision of the acquired results – ontology

Page 31: Extraction of Ontological Information from Lexicon and Corpora

31 Oslo, 14-16 Sep 2003

Conclusion cont’d

To prevent overgenerating of incorrect ontological annotation special attention needs to be paid to:

disambiguation of polysemous and homographic genus words (hyperonyms)

krona [Artifact], [Money], [Part]

analysis of compound nouns

gosedjur [Artifact] vs husdjur [Animal]

Page 32: Extraction of Ontological Information from Lexicon and Corpora

32 Oslo, 14-16 Sep 2003

References

Cederholm Y. 1999. Automatisk konstruktion av en hyperonymitaxonomi baserad på definitioner i GLDB. In Från dataskärm och forskarpärm. MISS 25. Göteborgs universitet.

Hearst, M. 2003. Text Data Mining. In ed. R. Mitkov The Oxford Handbook of Computational Linguistics Oxford.

Mitkov, R. 2003. The Oxford Handbook of Computational Linguistics Oxford. Oxford University Press.

Vossen, P. 2003. Ontologies. In ed. R. Mitkov The Oxford Handbook of Computational Linguistics Oxford.

about SIMPLE see http://spraakbanken.gu.se