survey on wsd and ir apex@sjtu. wsd: introduction problems in online news retrieval system: query:...

47
Survey on WSD and IR Apex@SJTU

Upload: valentine-mccormick

Post on 21-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Survey on WSD and IR

Apex@SJTU

Page 2: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

WSD: Introduction

Problems in online news retrieval system:

query: “major”

Articles retrieved: about “Prime Minister John Major MP” “major” appears as an adjective “major” appears as a military rank

Page 3: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

WSD: Introduction Gale, Church and Yarowsky (1992) cite work dating b

ack to 1950. For many years, WSD was applied only to limited do

mains and a small vocabulary. In recent years, disambiguators are applied to resolv

e the senses of words in a large heterogeneous corpus.

With a more accurate representation and a query also marked up with word sense, researchers believe that the accuracy of retrieval would have to improve.

Page 4: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Approaches to disambiguation

Disambiguation based on manually generated rules

Disambiguation using evidence from existing corpora.

Page 5: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation based on manually generated rules

Weiss (1973): general context rule:

If the word “type” appears near to “print”, it most likely meant a small block of metal bearing a raised character on one end.

template rule: If “of” appears immediately after “type”, it

most likely meant a subdivision of a particular kind of thing.

Page 6: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Weiss (1973): Template rules were better, so replied them first. To create rules:

Examine 20 occurrences of an ambiguous word.Test these manually created rules on a further 30

occurrences. Accuracy: 90% Cause for errors: idiomatic uses.

Page 7: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation based on manually generated rules

Kelly and Stone (1975): created a set of rules for 6,000 wordsconsisted of contextual rules similar to those

of Weissin addition, used grammatical category of a

word as a strong indicator of sense: “the train” and “to train”

Page 8: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Kelly and Stone (1975): The grammar and context rules were groupe

d into sets so that only certain rules were applied in certain situations.

Conditional statements controlled the application of rule sets.

Unlike Weiss’s system, this disambiguator was designed to process a whole sentence at the same time.

Accuracy: not a success

Page 9: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation based on manually generated rules

Small and Rieger (1982) came to similar conclusions.

When this type of disambiguator was extended to work on larger vocabulary, the effort involved in building it became too great.

Since 1980s, WSD research has concentrated on automatically generated rules based on sense evidence derived from a machine readable corpus.

Page 10: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation using evidence from existing corpora

Lesk (1988): Resolve the sense of “ash” in :

There was ash from the coal fire. Dictionary definition looked up:

ash(1): The soft grey powder that remains after something has been burnt.

ash(2): A forest tree common in Britain. Definition of context words looked up:

coal(1): A black mineral which is dub from the earth, which can be burnt to given heat.

fire(1): The condition of burning; flames, light and great heat.

fire(2): The act of firing weapons or artillery at an enemy.

Page 11: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Lesk (1988): Sense definitions are ranked by scoring

function based on the number of words that co-occur.

Questionable: how often the word overlap necessary for disambiguation occurred.

Accuracy: “very brief experimentation”, 50%--70% No analysis for the failure, although

definition length is recognized as a possible factor in deciding which dictionary to use.

Page 12: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation using evidence from existing corpora Wilks et al. (1990): addressed this word overlap problem by usi

ng a technique of expanding a dictionary definition with words that commonly co-occurred with the text of that definition.

Co-occurrence information was derived from all definition texts in the dictionary.

Page 13: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Wilks et al. (1990): Longman’s Dictionary of Contemporary English

(LDOCE): all its definitions were written using a simplified vocabulary of around 2,000 words.

Few synonyms, a distracting element in the co-occurrence calculation.

“bank”: for economic sense: “money”, ”check”, ”rob” for geographical sense: “river”, ”flood”, ”bridge” Accuracy: “bank” in 200 sentences, judged correct if it coincides

with one manually chosen, 53% at fine-grained level(13 senses) and 85% at coarse-grained(5 senses) level.

They suggested using simulated annealing to disambiguate a whole sentence simultaneously.

Page 14: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguating simultaneously

Cowie et al. (1992): Accuracy: tested on 67 sentences, 47

% for fine-grained senses while 72% for coarse-grained ones.

No comparison with Wilks et al.’s. No baseline. A possible baseline: senses randomly chosen

A better one: select the most common sense

Page 15: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Manually tagging a corpus A technique in POS tagging:

manually mark up a large text corpus with POS tag, and then train a statistical classifier to associate features with occurrences of the tags.

Ng and Lee (1996): disambiguate 192,000 occurrences of 191 words. examine the following features:

POS and morphological form of the sense tagged word unordered set of its surrounding words local collocations relative to it and if the sense tagged word was a noun, the presence of a

verb was noted also.

Page 16: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Ng and Lee (1996): Experiments:

separated their corpus into training and test sets on an 89%--11% split

accuracy: 63.7% (baseline: 58.1%) sense definition used were from WordNet, 7.8 sense

s per word for nouns and 12.0 senses for verbs no comparison possible between WordNet definition

or LDOCE

Page 17: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Using thesauri: Yarowsky (1992)

Roget’s thesaurus: 1,042 semantic categories Grolier Multimedia Encyclopedia

To decide which semantic category an ambiguous word occurrence should be assigned: a set of clue words, one set for each category, was derived

from a POS tagged corpus the context of each occurrence was gathered a term selection process similar to relevance feedback was

used to derive clue words

Page 18: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Yarowsky (1992) Eg. clue words for animal/insects:

species, family bird, fish, cm, animal, tail, egg, wild, common, coat, female, inhabit, eat, nest

Comparison between words in the context and the clue word sets

Accuracy: 12 ambiguous words, several hundred occurrences, 92% of accuracy on average

Comparison were suspect.

Page 19: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Testing disambiguators

Few “pre-disambiguated” test corpora publicly available.

A sense tagged version of the Brown corpus, called SEMCOR, is available. Trec-like effort underway, called SENSEVAL.

Page 20: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

WSD and IR experiments Voorhees (1993): based on WordNet:

Each of 90,000 words and phrases is assigned to one or more synsets.

A synset is a set of words that are synonyms of each other; the words of a synset define it and its meaning.

All synsets are linked together to form a mostly hierarchical semantic network based on hypernymy and hyponymy.

Other relations: meronymy, holonymy, antonymy.

Page 21: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Voorhees (1993): the hood of a word sense contained in syns

et s: largest connected sub graph; contains s; contains only descendants of an ancestor of

s contains no synset that has a descendent th

at includes another instance of a member of s. Consistently worse, tagging sense inaccura

tely

Page 22: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

The hood of the first sense of “house” would include the words: housing, lodging, apartment, flat, cabin, gatehouse, bungalow, cottage.

Page 23: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Wallis (1993)

replace words with definitions from LDOCE. “ocean” and “sea”: ocean: The great mass of salt water that co

vers most of the earth; sea: the great body of salty water that cover

s much of the earth’s surface. disappointing results. no analysis of the cause.

Page 24: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Sussna (1993) Assign a weight to all relations and calculate the s

emantic distance between two synsets. Calculate semantic distance between context wor

ds and each of the the synsets to rank the synsets. Parameters: size of context (41 as optimal), the n

umber of words (only 10 because of computation consideration) disambiguated simultaneously.

Accuracy: 56%

Page 25: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Analyses of WSD & IR Krovetz & Croft: sense mismatches were si

gnificantly more likely to occur in non-relevant documents. word collocation skewed frequency distribution

Situations under which WSD may prove useful: where collocation is less prevalent where query words were used in a minority sen

se

Page 26: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Analyses of WSD & IR Sanderson (1994,1997):

pseudo-words: banana/kalashnikov/anecdote experiments on the factor of query length:

effectiveness of retrievals based on short query was greatly affected by the introduction of ambiguity but much less so for longer queries.

Page 27: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Analyses of WSD & IR Gonzalo et al. (1998): experiments based on SEM

COR, write a summary for each document and use it as a query, which is related with only one relevant document.

Cause for error: sense may be too specific newspaper as a business concern as opposed to

the physical object

Page 28: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Gonzalo et al. (1998):

synset based representation: retrieval based on synset seems to be the b

est erroneous disambiguation and its impact on

retrieval effectiveness: baseline precision: 52.6% when error 30%, precision 54.4% when error 60%, precision 49.1%

Page 29: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Sanderson (1997):

output word sense in a list ranked by a confidence score

accuracy: worse than the one without sense, better than the one tagged with one sense.

possible cause: errors.

Page 30: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation without sense definition

Zernik (1991): generate cluster for an ambiguous word by thre

e criteria: context words, grammatical category and derivational morphology.

associate the cluster with a dictionary sense.

eg. “train”: 95% of accuracy, grammatical category“office”: full of error

Page 31: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Disambiguation without sense definition

Schutze and Pederson (1995): Very few of the results which show 14% improvement

Cluster based on context words only: words with similar context are put into the same cluster, but recognized as a cluster if only the context appears more than fifty time sin corpus

Similar context of “ball”: tennis, football, cricket. Thus this method breaks up a word’s commonest sense into a number of uses (the sporting sense of ball).

Page 32: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Schutze and Pederson (1995):

score each use of a word representing a word occurrence by

just the word word with its commonest use word with n of its uses

Page 33: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

WSD in IR Revisited sigir’03 Skewed frequency distributions coupled with the q

uery term co-occurrence effect are the reasons why traditional IR techniques that don’t take sense into account are not penalized severely.

The impact of inaccurate fine grained WSD has an extreme negative effect on the performance of an IR system.

To achieve increases in performance, it is imperative to minimize the impact of the inaccurate disambiguation.

The need for 90% accurate disambiguation in order to see performance increases remains questionable.

Page 34: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

The WSD methods applied A number of experiments were tried, but nothing b

etter than the following was found: applying each of knowledge source (collocations, co-occurrence, and sense frequency) in a stepwise fashion:

a context window consisting of the sentence surrounding the target word to identify sense of the word

examine the surrounding sentence if it contained any collocates we have observed from Semcor

specific sense data

Page 35: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

WSD in IR Revisited: Conclusions

Reasons for success:

high precision WSD technique

sense frequency statistics Resilience of vector space model Analysis for Schutze and Pederson’s succe

ss: added tolerance

Page 36: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

“A highly accurate bootstrapping algorithm for word sense disambiguatio

n” Rada M. 2000

Disambiguate all nouns and verbs: step 1: complex nominals step 2: name entity step 3: word pairs, based on SEMCOR (previous word, word) pair, (word, successive wor

d) pair step 4: context, based on SEMCOR and WordNet in WordNet, hypernym are also its context

Page 37: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

“A highly accurate bootstrapping algorithm for word sense disambiguation” (cont’d)

step 5: words with semantic distance 0 from some words which has already been disambiguated

step 6: words with semantic distance 1 from some words which has already been disambiguated

step 7: words with semantic distance 0 among ambiguous words

step 8: words with semantic distance 1 among ambiguous words

Page 38: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

“An Effective Approach to Document Retrieval via Utilizing WordNet and R

ecognizing Phrases” sigir 04

Significant increase for short query Only WSD on Query and Query Expansion Phrase-based and Term-based PSEUDO-RELEVANCE

Page 39: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Phrases identification

4 types of phrases: Proper names (Name Entity), Dictionary Phrases( by WordNet), a simple phrases, a complex phrase

Decide windows size of simple/complex phrases by calculate correlation

Page 40: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Correlation

Page 41: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

WSD

Unlike Rada Miha’s WSD, Liu didn’t utilize Semcor, only utilize WORDNET

6 step, basic ideas, by hyper, hypo, cross-reference,etc

Page 42: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Query Expansion

Add Synonyms(conditional) Add Definition Words( only first shortest nou

n phrase) conditional if it is highly globally correlated

Add Hyponyms(conditional) Add Compound Word(conditional)

Page 43: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

PSEUDO RELEVANCE FEEDBACK

Using Global Correlations and Wordnet

Global_cor>1 and one of two conditions: 1: monosense 2:its defintion contains some other query terms 3.it is in top10 ranked documents

Combining Local and Global Correlations:

Page 44: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Results

SO: standard Okapi (term-similarity) NO: enhanced SO NO+P: +phrase-similarity NO+P+D: +WSD NO+P+D+F: +Pseudo-feedback

Page 45: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Results:

Page 46: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Model conclusionWSD query onlyWSD only by Wordnet, no se

mcorQuery Complicate ExpansionPseudo-relevance feedbackPhrases and term-based

Page 47: Survey on WSD and IR Apex@SJTU. WSD: Introduction  Problems in online news retrieval system: query: “major” Articles retrieved:  about “Prime Minister

Thank you!