comp791a: statistical language processing

63
1 COMP791A: Statistical Language Processing Word Sense Disambiguation Chap. 7

Upload: swain

Post on 13-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

COMP791A: Statistical Language Processing. Word Sense Disambiguation Chap. 7. Overview of the problem. Many words have several meanings or senses (homonyms or polysemous words) Ex: “chair” --> furniture or person Ex: “dishes” --> plates or food - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: COMP791A: Statistical Language Processing

1

COMP791A: Statistical Language Processing

Word Sense DisambiguationChap. 7

Page 2: COMP791A: Statistical Language Processing

2

Overview of the problem Many words have several meanings or senses

(homonyms or polysemous words) Ex: “chair” --> furniture or person Ex: “dishes” --> plates or food

Need to determine which sense of a word is used in a specific sentence

Note: often, the different senses of a word are closely related

Ex: “title” --> right of legal ownership, document that is evidence of the

legal ownership, name of work,…

often, several senses can be “activated” in a single context (co-activation)

Ex: “This could bring competition to the trade” Competition --> the act of competing AND the people who

are competing

Page 3: COMP791A: Statistical Language Processing

3

Word Sense Disambiguation (WSD) To determine which of the senses of an

ambiguous word is invoked in a particular use of the word.

Potentially extremely useful problem Ex: in machine translation…

“chair” --> (person) “directeur” “chair” --> (furniture) “chaise” “bureau” --> “desk” “bureau” --> “office”

Can be done: with rule-based methods with statistical methods

Page 4: COMP791A: Statistical Language Processing

4

WordNet most widely-used lexical database for English free! G. Miller at Princeton

www.cogsci.princeton.edu/~wn used in many applications of NLP EuroWorNet

Dutch, Italian, Spanish, German, French, Czech and Estonian

includes entries for open-class words only (nouns, verbs, adjectives & adverbs)

Page 5: COMP791A: Statistical Language Processing

5

WordNet Entries in WordNet 1.6 (now 2.0):

118,000 different word forms organized according to their meanings (senses)

each entry has a dictionary-style definition (gloss) of each sense AND a set of domain-independent lexical relations among

WordNet’s entries (words) senses sets of synonyms

grouped into synsets (i.e. sets of synonyms)

Page 6: COMP791A: Statistical Language Processing

6

Example 1: WordNet entry for verb serve

Page 7: COMP791A: Statistical Language Processing

7

Rule-based WSD They served green-lipped mussels from New Zealand. Which airlines serve Denver?

semantic restrictions on the predicate of an argument

argument mussels:--> needs a predicate with the sense {provide-food}--> sense 6 of WordNet

argument Denver:--> needs a predicate with the sense {attend-to}--> sense 10 of WordNet

Page 8: COMP791A: Statistical Language Processing

8

Example 2: WordNet entry for dish

Page 9: COMP791A: Statistical Language Processing

9

Rule-based WSD In our house, everybody has a career and none of them includes

washing dishes. In her tiny kitchen, Ms. Chen works efficiently, stir-frying several

simple dishes, including braised pig’s ears and chicken livers with green peppers.

semantic restrictions on the argument of a predicate

predicate wash: --> needs an argument with the sense {object}--> senses 1, 2 or 6 form WordNet

predicate stir-fry:--> needs an argument with the sense {food}--> sense 2 of WordNet

Page 10: COMP791A: Statistical Language Processing

10

Problem with rule-based WSD

In some cases, the constraints on the predicate and on the argument are not enough to pinpoint one unique sense ex: “What kind of dishes do you recommend?”

Figures of speech meaning of words can be generated

dynamically instead of being fixed and stored in a lexicon

or set of selectional restrictions Ex: metaphor, metonymy

Page 11: COMP791A: Statistical Language Processing

11

Problem with rule-based WSD (con’t)

Metaphor: using words / phrases whose meaning are

appropriate to different kinds of concepts suggesting a likeness or analogy between them

This deal does not scare Microsoft. scare has 2 senses in WordNet:

to cause fear to cause to lose courage

metaphor: the corporation is viewed as a person

She is drowning in money metaphor: money is viewed as a liquid

Page 12: COMP791A: Statistical Language Processing

12

Problem with rule-based WSD (con’t)

Metonymy: referring to a concept by naming some other

concept closely related to it We await word from the crown.

a monarch is not the same thing as a crown but we often refer to the monarch as "the crown"

because the two are associated Metonymy : the crown refers to the monarch

The White House had no comment. Metonymy : The White House refers to the

administration

Page 13: COMP791A: Statistical Language Processing

13

WSD versus POS tagging “butter” can be a verb or noun

“I should butter my toasts.” “I like butter on my toasts.”

2 different POS --> 2 different usages with 2 different meanings So WSD can be viewed as POS tagging (classifying using semantic

tags rather than POS tags) But the 2 tasks are considered different… because:

nearby structural cues (ex: is the previous word a determiner?) are important in POS tagging are not effective for WSD

distant content words are very effective for WSD are not interesting for POS

So: in POS tagging, we typically only look at the local context in WSD, we use content words in a larger context

Page 14: COMP791A: Statistical Language Processing

14

Approaches to Statistical WSD Supervised Disambiguation

based on a labeled training set The learning system has:

a training set of feature-encoded inputs AND their appropriate sense label (category)

Based on Lexical Resources use of external lexical resources such as dictionaries

and thesauri Discourse properties Unsupervised Disambiguation

based on unlabeled corpora The learning system has:

a training set of feature-encoded inputs BUT NOT their appropriate sense label (category)

Page 15: COMP791A: Statistical Language Processing

15

Approaches to Statistical WSD --> Supervised Disambiguation

Naïve Bayes Decision Trees

Use of Lexical Resources Dictionary-based Thesaurus-based Translation-based

Discourse properties Unsupervised Disambiguation

Page 16: COMP791A: Statistical Language Processing

16

Supervised WSD: Overview

Surrounding words Most probable sense

…river… fish

…violin… instrument

…salmon… fish

play/ Verb + bass instrument

bass + player instrument

…striped… fish

A word is assumed to have a finite number of discrete senses.

The sense of a word depends on the sense of surrounding words

ex: bass = fish, musical instrument, ...

Page 17: COMP791A: Statistical Language Processing

17

WSD is viewed as typical classification problem use machine learning techniques to train a system that learns a classifier (a function f) to assign to

unseen examples one of a fixed number of senses (categories)

f(input) = correct sense Input:

Target word: The word to be disambiguated

Context (feature vector): a vector of relevant linguistic features that

represents its context (ex: a window of words around the target word)

Supervised WSD: Overview (con’t)

Page 18: COMP791A: Statistical Language Processing

18

Examples of Feature Vectors Take a window of n word around the target word Encode information about the words around the target word

typical features include: words, root forms, POS tags, frequency, …

An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps.

with position information [ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]

no position information, but word frequency [fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar,

band] [0,0,0,1,0,0,0,0,0,0,1,0]

other features: [followed by "player", contains "show" in the sentence,…] [yes, no, … ]

Page 19: COMP791A: Statistical Language Processing

19

Supervised WSD

Training corpus Each occurrence of the ambiguous word w

is annotated with a semantic label (its contextually appropriate sense sk).

Several approaches from ML Bayesian classification Decision trees Neural networks K-nearest neighbor (kNN) …

Page 20: COMP791A: Statistical Language Processing

20

Approaches to Statistical WSD --> Supervised Disambiguation

--> Naïve Bayes Decision Trees

Use of Lexical Resources Dictionary-based Thesaurus-based Translation-based

Discourse properties Unsupervised Disambiguation

Page 21: COMP791A: Statistical Language Processing

21

Naïve Bayes Classification Goal: choose the most probable sense s* for a word given a

vector V of surrounding words vector contains:

frequency of words vocabulary: [fishing, big, sound, player, fly, rod, …] [0, 0, 0, 2, 1, 0, …]

Bayes decision rule: s* = argmaxsk

P(sk|V)

where: S is the set of possible senses for the target word sk is a sense in S V is the feature vector (the representation of the

context) Using Bayes rule: P(V)

)P(s )s|P(V argmaxs* kk

ks

Page 22: COMP791A: Statistical Language Processing

22

Decision Rule for Naive Bayes

But: P(V) is the same for all possible senses, so it does not affect the final ranking of the senses, so we can drop it.

To make the computations simpler, we often take the log of probabilities:

P(V)

)s|P(v)P(sargmaxs*:So

tindependen llyconditiona are (context) features the //

assumption ceindependen // )s|P(v )s|P(V :But

rule Bayes //P(V)

))P(ss|P(Vargmax V)|P(s argmaxs* :Goal

n

1j kjk

ks

n

1j kjk

kk

ks

k

ks

)s|P(v )P(s argmaxs*n

1j kjk

sk

n

1jkjk

s

n

1j kjk

s

)s|P(v log )P(s logargmaxs* )s|P(v )P(s logargmax s*kk

Page 23: COMP791A: Statistical Language Processing

23

Training a Naïve Bayes classifier = estimating P(vj|sk) and P(sk) from a sense-tagged training

corpus= finding Maximum-Likelihood Estimation, perhaps with

appropriate smoothing

Naïve Bayes WSD

n

1jkjk

s

)s|P(v log )P(s logargmaxs*k

)s,count(v

)s|count(v )s|P(v

tkt

kj

kj

)count(word )count(s

)P(s kk

Nb of occurrences of feature j over the total nb of features appearing in windows of Sk

Nb of occurrences of sense k over nb of all occurrences

of ambiguous word

Page 24: COMP791A: Statistical Language Processing

24

Naïve Bayes Algorithm// 1. trainingfor all senses sk or word w

for all words vj in the vocabularycompute

for all senses sk of word wcompute

// 2. disambiguationfor all senses sk of word w

score(sk) = log P(sk)for all words vj in the context window

score (sk) = score (sk) + log P(vj | sk)

choose s* = with the greatest score(sk)

)s,count(v )s|count(v

)s|P(v

tkt

kj

kj

)count(word )count(s

)P(s kk

Page 25: COMP791A: Statistical Language Processing

25

Example Training corpus (context window = 3 words):

…Today the World Bank/BANK1 and partners are calling for greater relief……Welcome to the Bank/BANK1 of America the nation's leading financial institution… …Welcome to America's Job Bank/BANK1 Visit our site and……Web site of the European Central Bank/BANK1 located in Frankfurt……The Asian Development Bank/BANK1 ADB a multilateral development finance…

…lounging against verdant banks/BANK2 carving out the...…for swimming, had warned her off the banks/BANK2 of the Potomac. Nobody...

Training: P(the|BANK1) = 5/30 P(the|BANK2) = 3/12 P(world|BANK1) = 1/30 P(world|BANK2) = 0/12 P(and|BANK1) = 1/30 P(and|BANK2) = 0/12 … P(off|BANK1) = 0/30 P(off|BANK2) = 1/12 P(Potomac|BANK1) = 0/30 P(Potomac|BANK2) = 1/12

P(BANK1) = 5/7 P(BANK2) = 2/7

Disambiguation: “I lost my left shoe on the banks of the river Nile.”

Score(BANK1)=log(5/7) + log(P(shoe|BANK1))+log(P(on|BANK1))+log(P(the|BANK1)) …

Score(BANK2)=log(2/7) + log(P(shoe|BANK2))+log(P(on|BANK2))+log(P(the|BANK2)) …

Page 26: COMP791A: Statistical Language Processing

26

Naïve Bayes Assumption Independence assumption:

The features (contextual words) are conditionally independent:

Probability of an entire feature vector given a sense, is the product of the probabilities of its individual features given that sense

Consequences: Bag of words model:

the structure and linear ordering of words within the context is ignored. The presence of one word in the bag is independent of another.

The independence assumption is incorrect but is useful in WSD

(Gale, Church & Yarowsky, 1992) report 90% correct disambiguation with 6 ambiguous nouns in the Hansard

n

1j kjk )s|P(v )s|P(V

Page 27: COMP791A: Statistical Language Processing

27

Approaches to Statistical WSD --> Supervised Disambiguation

Naïve Bayes --> Decision Trees

Use of Lexical Resources Dictionary-based Thesaurus-based Translation-based

Discourse properties Unsupervised Disambiguation

Page 28: COMP791A: Statistical Language Processing

28

Decision Tree Classifier

Bayes Classifier uses information from all words in the context window

But some words are more reliable than others to indicate which sense is used…

Page 29: COMP791A: Statistical Language Processing

29

Decision Tree Classifier (con’t) Look for features that are very good indicators of the

result Place these features (as questions) in nodes of a

decision tree Split the examples so that those with different values

for the chosen feature are in a different set Repeat the same process with another feature

A sequence of tests is applied to each feature vector if test succeeds --> return the sense associated with the

test otherwise --> apply the next test if all features have been tested, then return a default

sense (most common one)

Page 30: COMP791A: Statistical Language Processing

30

Example: bassObservati

onFeatures Sense

Includes ”fish”?

“striped bass”?

Includes “guitar”?

“bass player”?

Includes “piano”?

1 Yes Yes No No No fish

2 Yes Yes No No No fish

3 No No Yes No No instrument

4 No Yes No No No fish

5 Yes Yes No No No fish

6 No No Yes Yes Yes instrument

7 No Yes No No No fish

is "fish" in the feature vector?

fish is "striped" the previous word?

fish is "guitar" in the feature vector?

instrument fish

yes

yes

yes

no

no

no

Page 31: COMP791A: Statistical Language Processing

31

Another Example: The restaurant Training data: OutputInput

Page 32: COMP791A: Statistical Language Processing

32

A first decision tree

But is it the best decision tree we can build?

Page 33: COMP791A: Statistical Language Processing

33

A better decision tree

4 tests instead of 9 & 11 branches instead of 21

Page 34: COMP791A: Statistical Language Processing

34

Choosing the best feature

The key problem is choosing which feature to split a given set of examples

Most used strategy: information theory

p(x)p(x)logH(X)

Xx2

bit 121

log21

21

log21

- )p(x)logp(x21

,21

Itoss) coin H(fair 22iXx

2i

i

Entropy (or self-information)

Page 35: COMP791A: Statistical Language Processing

35

Choosing the best feature (con't)

A)|tentropy(Set)entropy(SeA) gain(Set,

np

nlog

npn

npp

lognp

pt)entropy(Se 22

ii

i

ii

iv

1i

ii

npn

,np

pI x

npnp

A)|tentropy(Se

b failure of prob. and a sucess of prob. with attribute an for entropy b)I(a,

take can A attribute that values distinct of number the is v

:where

The "discriminating power" of an attribute A given a set S

if the training set contains: p positive examples and n negative examples

Page 36: COMP791A: Statistical Language Processing

36

Some intuition

Size Color Shape Output

Big Red Circle +

Small Red Circle +

Small Red Square -

Big Blue Circle -

Size is the least discriminating attribute (i.e. smallest information gain)

Shape and color are the most discriminating attribute (i.e. highest information gain)

Page 37: COMP791A: Statistical Language Processing

37

A small example

So first separate according to either color or shape (root of the tree)

Note: by definition 0log0 is 0

Size

big: 1+ 1- small: 1+ 1-

Color

red: 2+ 1- blue: 0+ 1-

Shape

circle: 2+ 1- square: 0+ 1-

Size Color Shape Output

Big Red Circle +

Small Red Circle +

Small Red Square -

Big Blue Circle -

0.31150.6885-1 Color)|tgain(Outpu

0.6885 (0)41

(0.918)43

Color)|tputentropy(Ou

011

log11

blue)|tputentropy(Ou

0.91831

log31

32

log32

red)|tputentropy(Ou

121

log21

21

log21

tput)entropy(Ou

2

22

22

0.31150.918-1 Shape)|tgain(Outpu

0.918(0)41

(0.918)43

Shape)|tputentropy(Ou

121

log21

21

log21

tput)entropy(Ou 22

01 -1 Size)|tgain(Outpu

1(1)21

(1)21

Size)|tputentropy(Ou

121

log21

21

log21

tput)entropy(Ou 22

Page 38: COMP791A: Statistical Language Processing

38

With the data on p.27, we have:

So root of the tree should be attribute Patrons (we gain more information)

do recursively for subtrees

The restaurant example

0.541bits...44

log44

40

log 40

x124

22

log 22

20

log 20

- x 122

1

64

,62

I x 126

44

,40

I x 124

22

,20

I x 122

1patron)|tgain(Outpu

2222

bits 042

,42

I x124

42

,42

I x 124

21

,21

I x 122

21

,21

I x 122

1type)|tgain(Outpu

Page 39: COMP791A: Statistical Language Processing

39

Back to WSD Need to translate the French word: “Prendre” can be seen as WSD possible translations/senses={take, make, rise,

speak}Observatio

nFeatures/Attributes Sense

Tense

Word left

Direct object

Word right

1 … … mesure … … take

2 … … note … … take

3 … … exemple … … take

4 … … decision … … make

5 … … parole … … speak

6 … … parole … … rise

Page 40: COMP791A: Statistical Language Processing

40

Back to WSD (con't)

(Brown et al., 1991) found: On Canadian Hansard

Ambiguous word

Possible senses / translations

Best Feature

Example

“Prendre” {“take ”, “make”, “rise”, “speak”}

Direct object “Prendre une mesure ” --> “to take”“Prendre une décision ” --> “to make”

“Vouloir” {“to want”, “to like”} Tense Present --> “to want”Conditional --> “to like”

“Cent” {“%”, “¢”} Word to the left

“Pour” --> “%”Number --> “¢”

Page 41: COMP791A: Statistical Language Processing

41

Training Set With supervised methods, we need a large sense-tagged

training set… where do you get it from? Using a "real" training set

Main standard hand sense-tagged corpora: SEMCOR corpus

portion of the Brown corpus tagged with WordNet senses

SENSEVAL corpus (www.senseval.org/) Standard WSD “competition” like MUC, TREC & DUC

Open Mind Word Expert(OMWE)

Using pseudowords: Artificial ambiguous words created by conflating two or more words. Ex: occurrences of “banana” and “door” can be replaced by

“banana-door” The disambiguation algorithm can now be tested on this data to

disambiguate the pseudoword “banana-door” into either “banana” or “door”

Page 42: COMP791A: Statistical Language Processing

42

Problems…

With supervised (or unsupervised) methods: need a large amount of work to create a classifier for each

ambiguous word! So most work based in these techniques, report work on a

few words (2 to 12 words) Scaling up these approaches to deal with all ambiguous

words is immense work!

Solution: use lexical resources (ex: machine-readable dictionaries) use distributional properties to improve disambiguation:

Ambiguous words are only used in one sense in any given discourse and with any given collocate.

Page 43: COMP791A: Statistical Language Processing

43

Approaches to Statistical WSD Supervised Disambiguation

Naïve Bayes Decision-tree

--> Use of Lexical Resources --> Dictionary-based Thesaurus-based Translation-based

Discourse properties Unsupervised Disambiguation

Page 44: COMP791A: Statistical Language Processing

44

WSD based on sense definitions

(Lesk, 1986) A word’s dictionary definitions are likely to be good

indicators for the sense they define.

Method: Express the dictionary definitions of the ambiguous

word as sets of bag-of-words Express the context of the ambiguous word as a

single bag-of-words from the dictionary definitions of the context words.

Choose the definition of the ambiguous word that has the greatest overlap with the words occurring in its context.

Page 45: COMP791A: Statistical Language Processing

45

Example "Cone" in dictionary:

DEF-1: “solid body which narrows to a point” BAG = {body, narrows, point, solid}

DEF-2: “something of this shape whether solid or hollow” BAG = {hollow, shape, something, solid}

DEF-3: “fruit of certain evergreen tree” BAG = {evergreen, fruit, tree}

To disambiguate "cone" in "pine cone" "Pine" in dictionary

DEF-1: “kind of evergreen tree” DEF-2: “waste away through sorrow or illness” --> BAG = {evergreen, illness, kind, sorrow, tree, waste}

so "cone" is: score(DEF-1) = {body, narrows, point, solid} {evergreen, illness, kind, sorrow, tree, waste}

= 0 score(DEF-2) = {hollow,shape,something,solid} {evergreen, illness, kind, sorrow, tree,

waste} = 0

score(DEF-3) = {evergreen, fruit, tree} {evergreen, illness, kind, sorrow, tree, waste} = 2

Max overlap: DEF-3

Page 46: COMP791A: Statistical Language Processing

46

The algorithm

For all senses sk of word w

score(sk) = overlap (

- words in the dictionary definition of sense sk

- the union of the words in all context windows that also appear in a definition of w

)pick the sense s* with the highest score(sk)

Page 47: COMP791A: Statistical Language Processing

47

Analysis Accuracies of 50-70% on short samples of texts Problem:

dictionary entries for the target words are usually relatively short and may not provide sufficient material to create adequate classifiers Because the words in the context and their definitions must have direct

overlap

One solution: expand the list of words whose definitions make use of the target word Example:

if “deposit” does not occur in the definition of “bank” but “bank” occurs in the definition of “deposit” We can expand the classifier for “bank” to include “deposit” as a relevant

feature However:

just knowing that “deposit” is related to “bank” does not help much if we do not know to which sense of “bank” it is related to --> To make use of “deposit” as a feature, we have to know which

sense of “bank” was being used in the definition Solution:

Use a thesaurus…

Page 48: COMP791A: Statistical Language Processing

48

Approaches to Statistical WSD Supervised Disambiguation

Naïve Bayes Decision-tree

--> Use of Lexical Resources Dictionary-based --> Thesaurus-based Translation-based

Discourse properties Unsupervised Disambiguation

Page 49: COMP791A: Statistical Language Processing

49

Thesaurus-Based Disambiguation Thesauri include tags (subject codes) in their entries

that correspond to broad semantic categories Each word is assigned one or more subject codes

which corresponds to its different meanings ANIMAL/INSECT (category 414) TOOLS/MACHINERY (category 348)

The semantic categories of the words in a context determine the semantic category of the whole context

This category, determines which word senses are used

For each subject code, count the number of words in the context that have the same subject code

Select the subject code that has the highest count

Accuracy ~50% (but with difficult and highly ambiguous words)

Page 50: COMP791A: Statistical Language Processing

50

Some Results

Roget categoriesWord Sense Roget category Accuracy

(Yarowsky, 1992)

bass musical instrument MUSIC 99%

fish ANIMAL,INSECT 100%

star space object UNIVERSE 96%

celebrity ENTERTAINER 95%

star-shaped object INSIGNIA 82%

interest curiosity REASONING 88%

advantage INJUSTICE 34%

financial DEBT 90%

share PROPERTY 38%

Page 51: COMP791A: Statistical Language Processing

51

Approaches to Statistical WSD Supervised Disambiguation

Naïve Bayes Decision-tree

--> Use of Lexical Resources Dictionary-based Thesaurus-based --> Translation-based

Discourse properties Unsupervised Disambiguation

Page 52: COMP791A: Statistical Language Processing

52

Translation-Based WSD Words can be disambiguated by looking at how they are

translated in other languages Example: the word “interest”

To disambiguate the word “interest” in “showed interest” German translation of “show” is “zeigen” In German corpus:

we always find “zeigen interesse” we never find “zeigen beteiligung”

So in the original phrase “showed interest”, interest had sense2

To disambiguate the word “interest” in “acquired an interest” German translation of “acquired ” is “erwarb” In German corpus: C(“erwarb”, “beteiligung”) > C(“erwarb”,

“interesse”) So in the original phrase “acquired an interest” interest is sense1

sense1 sense2

Definition legal share attention, concern

German Translation

“Beteiligung” “Interesse”

English phrase “acquire an interest” “show interest”

Translation “erwerb eine Beteiligung”

“Interesse zeigen”

Page 53: COMP791A: Statistical Language Processing

53

Approaches to Statistical WSD Supervised Disambiguation

Naïve Bayes Decision-tree

Use of Lexical Resources Dictionary-based Thesaurus-based Translation-based

--> Discourse properties Unsupervised Disambiguation

Page 54: COMP791A: Statistical Language Processing

54

Discourse Properties (Yarowsky, 1995) So far, all methods have considered each occurrence

of ambiguous word separately… But…

One sense per discourse One document --> one sense

One sense per collocation Select some nearby word that give very clues … ie.

select words of a collocation <-> sense of target word

(Yarowsky , 1995) shows a reduction of error rate by 27% when using the discourse constraint!

i.e. assign the majority sense of the discourse to all occurrences of the target word

we can combine these 2 heuristics

Page 55: COMP791A: Statistical Language Processing

55

Approaches to Statistical WSD Supervised Disambiguation

Naïve Bayes Decision-tree

Use of Lexical Resources Dictionary-based Thesaurus-based Translation-based

Discourse properties --> Unsupervised Disambiguation

Page 56: COMP791A: Statistical Language Processing

56

Unsupervised Disambiguation

Disambiguate word senses: without supporting tools such as dictionaries and

thesauri without a labeled training text

Without such resources, we cannot really identify/label the senses

ie. cannot say bank-1 or bank-2 we do not even know the different senses of a word!

But we can: Cluster/group the contexts of an ambiguous word into a

number of groups discriminate between these groups without actually

labeling them

Page 57: COMP791A: Statistical Language Processing

57

Clustering Represent each instance of the ambiguous word as a

vector <f1, f2, f3,…, fv > V is the vocabulary size fi is the frequency of word i in the context.

each vector can be visually represented in an V dimensional space

word

2

word1

V2

V1

V3

word3

Page 58: COMP791A: Statistical Language Processing

58

Clustering

hypothesis: same senses of words will have similar neighboring words

Disambiguation algorithm Identify context vectors corresponding to all

occurrences of a particular word Partition them into regions of high density Tag a sense for each such region Disambiguating a word:

Compute context vector of its occurrence Find the closest centroid of a region Assign the occurrence the sense of that centroid

Page 59: COMP791A: Statistical Language Processing

59

Evaluating WSD Metrics:

Accuracy: the % of words that are tagged correctly Precision & Recall:

Good : nb of correct answers provided by the system Bad: nb of wrong answers provided by the system Null: nb of cases in which the system doesn’t provide any

answer compared to a gold standard

SEMCOR corpus, SENSEVAL corpus, original text without pseudo-words,…

Difficulty in evaluation: Nature of the senses to distinguish has a huge impact on

results coarse VS fine-grained sense distinction

ex: “chair” --> person VS furniture ex: “bank” --> financial institution VS building

Page 60: COMP791A: Statistical Language Processing

60

Bounds on Performance Upper and Lower Bounds on Performance:

Measure of how well an algorithm performs relative to the difficulty of the task.

Upper Bound: Human performance Around 97%-99% with few and clearly distinct senses Inter-judge agreement:

With words with clear & distinct senses --> 95% and up With polysemous words with related senses 65%-70%

Lower Bound (or baseline): Usually the assignment of the most frequent sense 90% is excellent for a word with 2 equiprobable senses 90% is trivial for a word with 2 senses with probability ratios

of 9 to 1 !!!

Page 61: COMP791A: Statistical Language Processing

61

SENSEVAL (www.senseval.org)

Standard WSD “competition” like MUC, TREC & DUC Goals:

Provide a common framework to compare WSD systems

Standardise the task (especially evaluation procedures)

Build and distribute new lexical resources Senseval-1 (1998)

English, French and Italian HECTOR senses (Oxford University Press)

Senseval-2 (2001) 13 languages, including Chinese WordNet senses

Senseval-3 (March 2004) 7 languages (but various tasks) WordNet senses

Page 62: COMP791A: Statistical Language Processing

62

Training text for "arm" (SENSEVAL-1) <instance id="arm.n.om.053"> <answer instance="arm.n.om.053" senseid="arm%1:08:00::"/>

<context>

Many <p="JJ"/> terrestrial <p="JJ"/> vertebrate <p="JJ"/> animals <p="NNS"/> have <p="VBP"/> four <p="CD"/> <ne="_NUM"/> limbs <p="NNS"/> . <p="."/> Those <p="DT"/> attached <p="VBN"/> to <p="TO"/> the <p="DT"/> thoracic <p="JJ"/> portion <p="NN"/> of <p="IN"/> the <p="DT"/> body <p="NN"/> are <p="VBP"/> called <p="VBN"/> " <p="""/> <head> arms <p="NNS"/> </head> . <p="."/> " <p="""/>

</context> </instance>

<instance id="arm.n.om.045"> <answer instance="arm.n.om.045" senseid="arm%1:06:02::"/>

<context> You <p="PRP"/> are <p="VBP"/> likely <p="JJ"/> to <p="TO"/> find <p="VB"/> a <p="DT"/> rocking_chair <p="NN"/> with <p="IN"/> <head> arms <p="NNS"/> </head> in <p="IN"/> a <p="DT"/> museum <p="NN"/>

</context> </instance>

<instance id="arm.n.la.029"> <answer instance="arm.n.la.029" senseid="arm%1:06:01::"/>

<context>

" <p="""/> Unlike <p="IN"/> Linder <p="NNP"/> , <p=","/> who <p="WP"/> was <p="VBD"/> reportedly <p="RB"/> carrying <p="VBG"/> a <p="DT"/> Kalashnikov <p="NNP"/> assault_rifle <p="NN"/> for <p="IN"/> protection <p="NN"/> , <p=","/> APSNICA <p="NNP"/> volunteers <p="NNS"/> do <p="VBP"/> not <p="RB"/> bear <p="VB"/> <head> arms <p="NNS"/> </head> . <p="."/>

</context> </instance>

Page 63: COMP791A: Statistical Language Processing

63

What is a word sense anyways? “A mental representations of different meaning

of a word” Experiments in psycho-linguistics

Ask subjects classify index cards with sentences containing an ambiguous words into different piles

But inter-subject agreement is low…

Rely on introspection But introspection tends to rationalize often non-rational

decisions

Ask subjects to classify ambiguous words according to dictionary definitions

Some results show high inter-subject agreement, some results show low agreement!!!