survey on wsd and ir apex@sjtu. wsd: introduction problems in online news retrieval system: query:...
TRANSCRIPT
Survey on WSD and IR
Apex@SJTU
WSD: Introduction
Problems in online news retrieval system:
query: “major”
Articles retrieved: about “Prime Minister John Major MP” “major” appears as an adjective “major” appears as a military rank
WSD: Introduction Gale, Church and Yarowsky (1992) cite work dating b
ack to 1950. For many years, WSD was applied only to limited do
mains and a small vocabulary. In recent years, disambiguators are applied to resolv
e the senses of words in a large heterogeneous corpus.
With a more accurate representation and a query also marked up with word sense, researchers believe that the accuracy of retrieval would have to improve.
Approaches to disambiguation
Disambiguation based on manually generated rules
Disambiguation using evidence from existing corpora.
Disambiguation based on manually generated rules
Weiss (1973): general context rule:
If the word “type” appears near to “print”, it most likely meant a small block of metal bearing a raised character on one end.
template rule: If “of” appears immediately after “type”, it
most likely meant a subdivision of a particular kind of thing.
Weiss (1973): Template rules were better, so replied them first. To create rules:
Examine 20 occurrences of an ambiguous word.Test these manually created rules on a further 30
occurrences. Accuracy: 90% Cause for errors: idiomatic uses.
Disambiguation based on manually generated rules
Kelly and Stone (1975): created a set of rules for 6,000 wordsconsisted of contextual rules similar to those
of Weissin addition, used grammatical category of a
word as a strong indicator of sense: “the train” and “to train”
Kelly and Stone (1975): The grammar and context rules were groupe
d into sets so that only certain rules were applied in certain situations.
Conditional statements controlled the application of rule sets.
Unlike Weiss’s system, this disambiguator was designed to process a whole sentence at the same time.
Accuracy: not a success
Disambiguation based on manually generated rules
Small and Rieger (1982) came to similar conclusions.
When this type of disambiguator was extended to work on larger vocabulary, the effort involved in building it became too great.
Since 1980s, WSD research has concentrated on automatically generated rules based on sense evidence derived from a machine readable corpus.
Disambiguation using evidence from existing corpora
Lesk (1988): Resolve the sense of “ash” in :
There was ash from the coal fire. Dictionary definition looked up:
ash(1): The soft grey powder that remains after something has been burnt.
ash(2): A forest tree common in Britain. Definition of context words looked up:
coal(1): A black mineral which is dub from the earth, which can be burnt to given heat.
fire(1): The condition of burning; flames, light and great heat.
fire(2): The act of firing weapons or artillery at an enemy.
Lesk (1988): Sense definitions are ranked by scoring
function based on the number of words that co-occur.
Questionable: how often the word overlap necessary for disambiguation occurred.
Accuracy: “very brief experimentation”, 50%--70% No analysis for the failure, although
definition length is recognized as a possible factor in deciding which dictionary to use.
Disambiguation using evidence from existing corpora Wilks et al. (1990): addressed this word overlap problem by usi
ng a technique of expanding a dictionary definition with words that commonly co-occurred with the text of that definition.
Co-occurrence information was derived from all definition texts in the dictionary.
Wilks et al. (1990): Longman’s Dictionary of Contemporary English
(LDOCE): all its definitions were written using a simplified vocabulary of around 2,000 words.
Few synonyms, a distracting element in the co-occurrence calculation.
“bank”: for economic sense: “money”, ”check”, ”rob” for geographical sense: “river”, ”flood”, ”bridge” Accuracy: “bank” in 200 sentences, judged correct if it coincides
with one manually chosen, 53% at fine-grained level(13 senses) and 85% at coarse-grained(5 senses) level.
They suggested using simulated annealing to disambiguate a whole sentence simultaneously.
Disambiguating simultaneously
Cowie et al. (1992): Accuracy: tested on 67 sentences, 47
% for fine-grained senses while 72% for coarse-grained ones.
No comparison with Wilks et al.’s. No baseline. A possible baseline: senses randomly chosen
A better one: select the most common sense
Manually tagging a corpus A technique in POS tagging:
manually mark up a large text corpus with POS tag, and then train a statistical classifier to associate features with occurrences of the tags.
Ng and Lee (1996): disambiguate 192,000 occurrences of 191 words. examine the following features:
POS and morphological form of the sense tagged word unordered set of its surrounding words local collocations relative to it and if the sense tagged word was a noun, the presence of a
verb was noted also.
Ng and Lee (1996): Experiments:
separated their corpus into training and test sets on an 89%--11% split
accuracy: 63.7% (baseline: 58.1%) sense definition used were from WordNet, 7.8 sense
s per word for nouns and 12.0 senses for verbs no comparison possible between WordNet definition
or LDOCE
Using thesauri: Yarowsky (1992)
Roget’s thesaurus: 1,042 semantic categories Grolier Multimedia Encyclopedia
To decide which semantic category an ambiguous word occurrence should be assigned: a set of clue words, one set for each category, was derived
from a POS tagged corpus the context of each occurrence was gathered a term selection process similar to relevance feedback was
used to derive clue words
Yarowsky (1992) Eg. clue words for animal/insects:
species, family bird, fish, cm, animal, tail, egg, wild, common, coat, female, inhabit, eat, nest
Comparison between words in the context and the clue word sets
Accuracy: 12 ambiguous words, several hundred occurrences, 92% of accuracy on average
Comparison were suspect.
Testing disambiguators
Few “pre-disambiguated” test corpora publicly available.
A sense tagged version of the Brown corpus, called SEMCOR, is available. Trec-like effort underway, called SENSEVAL.
WSD and IR experiments Voorhees (1993): based on WordNet:
Each of 90,000 words and phrases is assigned to one or more synsets.
A synset is a set of words that are synonyms of each other; the words of a synset define it and its meaning.
All synsets are linked together to form a mostly hierarchical semantic network based on hypernymy and hyponymy.
Other relations: meronymy, holonymy, antonymy.
Voorhees (1993): the hood of a word sense contained in syns
et s: largest connected sub graph; contains s; contains only descendants of an ancestor of
s contains no synset that has a descendent th
at includes another instance of a member of s. Consistently worse, tagging sense inaccura
tely
The hood of the first sense of “house” would include the words: housing, lodging, apartment, flat, cabin, gatehouse, bungalow, cottage.
Wallis (1993)
replace words with definitions from LDOCE. “ocean” and “sea”: ocean: The great mass of salt water that co
vers most of the earth; sea: the great body of salty water that cover
s much of the earth’s surface. disappointing results. no analysis of the cause.
Sussna (1993) Assign a weight to all relations and calculate the s
emantic distance between two synsets. Calculate semantic distance between context wor
ds and each of the the synsets to rank the synsets. Parameters: size of context (41 as optimal), the n
umber of words (only 10 because of computation consideration) disambiguated simultaneously.
Accuracy: 56%
Analyses of WSD & IR Krovetz & Croft: sense mismatches were si
gnificantly more likely to occur in non-relevant documents. word collocation skewed frequency distribution
Situations under which WSD may prove useful: where collocation is less prevalent where query words were used in a minority sen
se
Analyses of WSD & IR Sanderson (1994,1997):
pseudo-words: banana/kalashnikov/anecdote experiments on the factor of query length:
effectiveness of retrievals based on short query was greatly affected by the introduction of ambiguity but much less so for longer queries.
Analyses of WSD & IR Gonzalo et al. (1998): experiments based on SEM
COR, write a summary for each document and use it as a query, which is related with only one relevant document.
Cause for error: sense may be too specific newspaper as a business concern as opposed to
the physical object
Gonzalo et al. (1998):
synset based representation: retrieval based on synset seems to be the b
est erroneous disambiguation and its impact on
retrieval effectiveness: baseline precision: 52.6% when error 30%, precision 54.4% when error 60%, precision 49.1%
Sanderson (1997):
output word sense in a list ranked by a confidence score
accuracy: worse than the one without sense, better than the one tagged with one sense.
possible cause: errors.
Disambiguation without sense definition
Zernik (1991): generate cluster for an ambiguous word by thre
e criteria: context words, grammatical category and derivational morphology.
associate the cluster with a dictionary sense.
eg. “train”: 95% of accuracy, grammatical category“office”: full of error
Disambiguation without sense definition
Schutze and Pederson (1995): Very few of the results which show 14% improvement
Cluster based on context words only: words with similar context are put into the same cluster, but recognized as a cluster if only the context appears more than fifty time sin corpus
Similar context of “ball”: tennis, football, cricket. Thus this method breaks up a word’s commonest sense into a number of uses (the sporting sense of ball).
Schutze and Pederson (1995):
score each use of a word representing a word occurrence by
just the word word with its commonest use word with n of its uses
WSD in IR Revisited sigir’03 Skewed frequency distributions coupled with the q
uery term co-occurrence effect are the reasons why traditional IR techniques that don’t take sense into account are not penalized severely.
The impact of inaccurate fine grained WSD has an extreme negative effect on the performance of an IR system.
To achieve increases in performance, it is imperative to minimize the impact of the inaccurate disambiguation.
The need for 90% accurate disambiguation in order to see performance increases remains questionable.
The WSD methods applied A number of experiments were tried, but nothing b
etter than the following was found: applying each of knowledge source (collocations, co-occurrence, and sense frequency) in a stepwise fashion:
a context window consisting of the sentence surrounding the target word to identify sense of the word
examine the surrounding sentence if it contained any collocates we have observed from Semcor
specific sense data
WSD in IR Revisited: Conclusions
Reasons for success:
high precision WSD technique
sense frequency statistics Resilience of vector space model Analysis for Schutze and Pederson’s succe
ss: added tolerance
“A highly accurate bootstrapping algorithm for word sense disambiguatio
n” Rada M. 2000
Disambiguate all nouns and verbs: step 1: complex nominals step 2: name entity step 3: word pairs, based on SEMCOR (previous word, word) pair, (word, successive wor
d) pair step 4: context, based on SEMCOR and WordNet in WordNet, hypernym are also its context
“A highly accurate bootstrapping algorithm for word sense disambiguation” (cont’d)
step 5: words with semantic distance 0 from some words which has already been disambiguated
step 6: words with semantic distance 1 from some words which has already been disambiguated
step 7: words with semantic distance 0 among ambiguous words
step 8: words with semantic distance 1 among ambiguous words
“An Effective Approach to Document Retrieval via Utilizing WordNet and R
ecognizing Phrases” sigir 04
Significant increase for short query Only WSD on Query and Query Expansion Phrase-based and Term-based PSEUDO-RELEVANCE
Phrases identification
4 types of phrases: Proper names (Name Entity), Dictionary Phrases( by WordNet), a simple phrases, a complex phrase
Decide windows size of simple/complex phrases by calculate correlation
Correlation
WSD
Unlike Rada Miha’s WSD, Liu didn’t utilize Semcor, only utilize WORDNET
6 step, basic ideas, by hyper, hypo, cross-reference,etc
Query Expansion
Add Synonyms(conditional) Add Definition Words( only first shortest nou
n phrase) conditional if it is highly globally correlated
Add Hyponyms(conditional) Add Compound Word(conditional)
PSEUDO RELEVANCE FEEDBACK
Using Global Correlations and Wordnet
Global_cor>1 and one of two conditions: 1: monosense 2:its defintion contains some other query terms 3.it is in top10 ranked documents
Combining Local and Global Correlations:
Results
SO: standard Okapi (term-similarity) NO: enhanced SO NO+P: +phrase-similarity NO+P+D: +WSD NO+P+D+F: +Pseudo-feedback
Results:
Model conclusionWSD query onlyWSD only by Wordnet, no se
mcorQuery Complicate ExpansionPseudo-relevance feedbackPhrases and term-based
Thank you!