extracting knowledge-bases from machine-readable dictionaries: have we wasted our time?

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our

Time?

Nancy Ide and Jean Veronis

Proc KB&KB’93 Workshop, 1993, pp257-266

http://www.cs.vassar.edu/faculty/ide/pubs.html

As (mis-)interpreted by Peter Clark

http://www.cs.vassar.edu/faculty/ide/pubs.html

The Postulates of MRD Work• P1: MRDs contain information that is useful for NLP

e.g.:

The Postulates of MRD Work• P1: MRDs contain information that is useful for NLP• P2: This info is relatively easy to extract from MRDs

e.g., extraction of hypernyms (generalizations):

Dipper isa Ladle isa Spoon isa Utensil

But…• Not much to show for it so far (1993)

– handful of limited and imperfect taxonomies

– few studies on the quality of knowledge in MRDs

– few studies on extracting more complex info

Complaints…

• P1: useful info in MRDs:– C1a: 50%-70% of info in dictionaries is “garbled”

– C1b: sense definitions concept usage (“real concepts”)

– C1c: some types of knowledge simply not there

• P2: Info can be easily extracted• Most successes have been for hypernyms only

– C2a: MRD formats are a nightmare to deal with

– C2b: A virtually open-ended set of ways of describing facts

– C2c: Bootstrapping: Need a KB to build a KB from a MRD

C1a: MRD information is “garbled”

• Multiple people, multiple years effort

• Space restrictions, syntactic restrictions

• Particular problem 1:– Attachment of terms too high (21%-34%)

• e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers”

• occurs fairly randomly

– Categories less clear at top levels• “fork” and “spoon” is ok, but “implement” and “utensil” = ?

• Sometimes no word there to refer to a concept– leads to circular definitions

• Particular problem 2:– Categories less clear at top levels

• “fork” and “spoon” is ok, but “implement” and “utensil” = ?

• Leads to disjuncts e.g. “implement or utensil”

• Sometimes no word there to refer to a concept– leads to circular definitions

– leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”)


• Particular problem 3:– And hypernyms are

relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc.


• Ambiguity of word senses, e.g.,– 87% of words in a sample fit > 1 word sense

• Word senses don’t reflect actual use• Word sense distinctions differ between MRDs

– level of detail

– way lines are drawn between senses

– no definitive set of distinctions

C1b: sense definitions concept usage (“real concepts”)

C1c: some types of knowledge simply not there

• no broad contextual or world knowledge, e.g.,– no connection between “lawn” and “house”, or between

“ash” and “tobacco”

– “restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet]

• No mention that it’s a commercial business, e.g., for “the waitress collected the check.”

C2a: MRD formats are a nightmare to deal with

• Ambiguities / inconsistencies in typesetter format• Complex grammars for entries• Conventions are inconsistent, e.g. bracketing for

– “Canopic jar, urn, or vase” vs.

– “Junggar Pendi, Dzungaria, or Zungaria”

• Need a lot of hand pre-processing – not much general value to this

– is a vast task in itself

– not many processed dictionaries available

C2b: A virtually open-ended set of ways of describing facts

But…

There is “virtually an open-ended set of phrases…”

C2c: Bootstrapping: Need a KB to build a KB

Need knowledge to do NLP on MRDs! – e.g. “carry by means of a handle” vs. “carry by means of a

wagon”

• But undisambiguated hierarchy is unusable, e.g., – “saucepan” isa “pan” isa “leaf” need to build your KB before you even start on the MRD

Synthesis• Underlying postulate of P1 and P2:

– P0: Large KBs cannot be built by hand• Counterexamples:

– Cyc– Dictionaries themselves!

• And besides…– KBs are too hard to extract from MRDs– don’t contain all the knowledge needed

• But: MRD contributions:– understanding the structure of dictionaries– convergance of NLP, lexicography, and electronic

publishing interests

Ways forward…

• Combining Knowledge Sources:– One dictionary has 55%-70% of “problematic cases” [of

incompleteness], but 5 dictionaries reduced this to 5%

• Also should combine knowledge from corpora as a means of “filling out” KBs

• Prediction:– KBs built by people, using corpora and text extraction

technology tools, and combined together by hand (Schubert-style; Code4; Ikarus)

Ways forward…

• MRDs will become encoded more consistently• Better analysis needed of the types of knowledge

needed for NLP– perhaps don’t need the kind of precision in a KB

• Exploitation of associational information– Very useful for sense disambiguation (e.g., Harabagiu)

Ways forward…• Lexicographers increasingly interested in using lexical

databases for their work• Could create a NLP-like KB directly

– Create explicit semantic links between word entries

– Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)

Ways forward…

Ways forward…• Lexicographers increasingly interested in using lexical

databases for their work• Could create a NLP-like KB directly

– Create explicit semantic links between word entries

– Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)

– Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated)

– Ensure consistency of sense division• e.g., “cup” and “bowl” have two senses (literal and metonymic) but

“glass” only has one (literal) could spot this inconsistency

extracting knowledge-bases from machine-readable dictionaries: have we wasted our time?

Documents

mrd information

mrd formats

types of knowledge

useful info

quality of knowledge

world knowledge

openended set of ways

openended set of phrasesc2c