comp791a: statistical language processing

COMP791A: Statistical Language Processing

Word Sense DisambiguationChap. 7

Overview of the problem Many words have several meanings or senses

(homonyms or polysemous words) Ex: “chair” --> furniture or person Ex: “dishes” --> plates or food

Need to determine which sense of a word is used in a specific sentence

Note: often, the different senses of a word are closely related

Ex: “title” --> right of legal ownership, document that is evidence of the

legal ownership, name of work,…

often, several senses can be “activated” in a single context (co-activation)

Ex: “This could bring competition to the trade” Competition --> the act of competing AND the people who

are competing

Word Sense Disambiguation (WSD) To determine which of the senses of an

ambiguous word is invoked in a particular use of the word.

Potentially extremely useful problem Ex: in machine translation…

“chair” --> (person) “directeur” “chair” --> (furniture) “chaise” “bureau” --> “desk” “bureau” --> “office”

Can be done: with rule-based methods with statistical methods

WordNet most widely-used lexical database for English free! G. Miller at Princeton

www.cogsci.princeton.edu/~wn used in many applications of NLP EuroWorNet

Dutch, Italian, Spanish, German, French, Czech and Estonian

includes entries for open-class words only (nouns, verbs, adjectives & adverbs)

WordNet Entries in WordNet 1.6 (now 2.0):

118,000 different word forms organized according to their meanings (senses)

each entry has a dictionary-style definition (gloss) of each sense AND a set of domain-independent lexical relations among

WordNet’s entries (words) senses sets of synonyms

grouped into synsets (i.e. sets of synonyms)

Example 1: WordNet entry for verb serve

Rule-based WSD They served green-lipped mussels from New Zealand. Which airlines serve Denver?

semantic restrictions on the predicate of an argument

argument mussels:--> needs a predicate with the sense {provide-food}--> sense 6 of WordNet

argument Denver:--> needs a predicate with the sense {attend-to}--> sense 10 of WordNet

Example 2: WordNet entry for dish

Rule-based WSD In our house, everybody has a career and none of them includes

washing dishes. In her tiny kitchen, Ms. Chen works efficiently, stir-frying several

simple dishes, including braised pig’s ears and chicken livers with green peppers.

semantic restrictions on the argument of a predicate

predicate wash: --> needs an argument with the sense {object}--> senses 1, 2 or 6 form WordNet

predicate stir-fry:--> needs an argument with the sense {food}--> sense 2 of WordNet

Problem with rule-based WSD

In some cases, the constraints on the predicate and on the argument are not enough to pinpoint one unique sense ex: “What kind of dishes do you recommend?”

Figures of speech meaning of words can be generated

dynamically instead of being fixed and stored in a lexicon

or set of selectional restrictions Ex: metaphor, metonymy

Problem with rule-based WSD (con’t)

Metaphor: using words / phrases whose meaning are

appropriate to different kinds of concepts suggesting a likeness or analogy between them

This deal does not scare Microsoft. scare has 2 senses in WordNet:

to cause fear to cause to lose courage

metaphor: the corporation is viewed as a person

She is drowning in money metaphor: money is viewed as a liquid

Problem with rule-based WSD (con’t)

Metonymy: referring to a concept by naming some other

concept closely related to it We await word from the crown.

a monarch is not the same thing as a crown but we often refer to the monarch as "the crown"

because the two are associated Metonymy : the crown refers to the monarch

The White House had no comment. Metonymy : The White House refers to the

administration

WSD versus POS tagging “butter” can be a verb or noun

“I should butter my toasts.” “I like butter on my toasts.”

2 different POS --> 2 different usages with 2 different meanings So WSD can be viewed as POS tagging (classifying using semantic

tags rather than POS tags) But the 2 tasks are considered different… because:

nearby structural cues (ex: is the previous word a determiner?) are important in POS tagging are not effective for WSD

distant content words are very effective for WSD are not interesting for POS

So: in POS tagging, we typically only look at the local context in WSD, we use content words in a larger context

Approaches to Statistical WSD Supervised Disambiguation

based on a labeled training set The learning system has:

a training set of feature-encoded inputs AND their appropriate sense label (category)

Based on Lexical Resources use of external lexical resources such as dictionaries

and thesauri Discourse properties Unsupervised Disambiguation

based on unlabeled corpora The learning system has:

a training set of feature-encoded inputs BUT NOT their appropriate sense label (category)

Approaches to Statistical WSD --> Supervised Disambiguation

Naïve Bayes Decision Trees

Use of Lexical Resources Dictionary-based Thesaurus-based Translation-based

Discourse properties Unsupervised Disambiguation

Supervised WSD: Overview

Surrounding words Most probable sense

…river… fish

…violin… instrument

…salmon… fish

play/ Verb + bass instrument

bass + player instrument

…striped… fish

A word is assumed to have a finite number of discrete senses.

The sense of a word depends on the sense of surrounding words

ex: bass = fish, musical instrument, ...

WSD is viewed as typical classification problem use machine learning techniques to train a system that learns a classifier (a function f) to assign to

unseen examples one of a fixed number of senses (categories)

f(input) = correct sense Input:

Target word: The word to be disambiguated

Context (feature vector): a vector of relevant linguistic features that

represents its context (ex: a window of words around the target word)

Supervised WSD: Overview (con’t)

Examples of Feature Vectors Take a window of n word around the target word Encode information about the words around the target word

typical features include: words, root forms, POS tags, frequency, …

An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps.

with position information [ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]

no position information, but word frequency [fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar,

band] [0,0,0,1,0,0,0,0,0,0,1,0]

other features: [followed by "player", contains "show" in the sentence,…] [yes, no, … ]

Supervised WSD

Training corpus Each occurrence of the ambiguous word w

is annotated with a semantic label (its contextually appropriate sense sk).

Several approaches from ML Bayesian classification Decision trees Neural networks K-nearest neighbor (kNN) …

--> Naïve Bayes Decision Trees

Naïve Bayes Classification Goal: choose the most probable sense s* for a word given a

vector V of surrounding words vector contains:

frequency of words vocabulary: [fishing, big, sound, player, fly, rod, …] [0, 0, 0, 2, 1, 0, …]

Bayes decision rule: s* = argmaxsk

P(sk|V)

where: S is the set of possible senses for the target word sk is a sense in S V is the feature vector (the representation of the

context) Using Bayes rule: P(V)

)P(s )s|P(V argmaxs* kk

Decision Rule for Naive Bayes

But: P(V) is the same for all possible senses, so it does not affect the final ranking of the senses, so we can drop it.

To make the computations simpler, we often take the log of probabilities:

)s|P(v)P(sargmaxs*:So

tindependen llyconditiona are (context) features the //

assumption ceindependen // )s|P(v )s|P(V :But

rule Bayes //P(V)

))P(ss|P(Vargmax V)|P(s argmaxs* :Goal

1j kjk

)s|P(v )P(s argmaxs*n

1j kjk

)s|P(v log )P(s logargmaxs* )s|P(v )P(s logargmax s*kk

Training a Naïve Bayes classifier = estimating P(vj|sk) and P(sk) from a sense-tagged training

corpus= finding Maximum-Likelihood Estimation, perhaps with

appropriate smoothing

Naïve Bayes WSD

)s|P(v log )P(s logargmaxs*k

)s,count(v

)s|count(v )s|P(v

)count(word )count(s

)P(s kk

Nb of occurrences of feature j over the total nb of features appearing in windows of Sk

Nb of occurrences of sense k over nb of all occurrences

of ambiguous word

Naïve Bayes Algorithm// 1. trainingfor all senses sk or word w

for all words vj in the vocabularycompute

for all senses sk of word wcompute

// 2. disambiguationfor all senses sk of word w

score(sk) = log P(sk)for all words vj in the context window

score (sk) = score (sk) + log P(vj | sk)

choose s* = with the greatest score(sk)

)s,count(v )s|count(v

)s|P(v

)count(word )count(s

)P(s kk

Example Training corpus (context window = 3 words):

…Today the World Bank/BANK1 and partners are calling for greater relief……Welcome to the Bank/BANK1 of America the nation's leading financial institution… …Welcome to America's Job Bank/BANK1 Visit our site and……Web site of the European Central Bank/BANK1 located in Frankfurt……The Asian Development Bank/BANK1 ADB a multilateral development finance…

…lounging against verdant banks/BANK2 carving out the...…for swimming, had warned her off the banks/BANK2 of the Potomac. Nobody...

P(BANK1) = 5/7 P(BANK2) = 2/7

Disambiguation: “I lost my left shoe on the banks of the river Nile.”

Score(BANK1)=log(5/7) + log(P(shoe|BANK1))+log(P(on|BANK1))+log(P(the|BANK1)) …

Score(BANK2)=log(2/7) + log(P(shoe|BANK2))+log(P(on|BANK2))+log(P(the|BANK2)) …

Naïve Bayes Assumption Independence assumption:

The features (contextual words) are conditionally independent:

Probability of an entire feature vector given a sense, is the product of the probabilities of its individual features given that sense

Consequences: Bag of words model:

the structure and linear ordering of words within the context is ignored. The presence of one word in the bag is independent of another.

The independence assumption is incorrect but is useful in WSD

(Gale, Church & Yarowsky, 1992) report 90% correct disambiguation with 6 ambiguous nouns in the Hansard

1j kjk )s|P(v )s|P(V

Naïve Bayes --> Decision Trees

Decision Tree Classifier

Bayes Classifier uses information from all words in the context window

But some words are more reliable than others to indicate which sense is used…

Decision Tree Classifier (con’t) Look for features that are very good indicators of the

result Place these features (as questions) in nodes of a

decision tree Split the examples so that those with different values

for the chosen feature are in a different set Repeat the same process with another feature

A sequence of tests is applied to each feature vector if test succeeds --> return the sense associated with the

test otherwise --> apply the next test if all features have been tested, then return a default

sense (most common one)

Example: bassObservati

onFeatures Sense

Includes ”fish”?

“striped bass”?

Includes “guitar”?

“bass player”?

Includes “piano”?

1 Yes Yes No No No fish

3 No No Yes No No instrument

4 No Yes No No No fish

6 No No Yes Yes Yes instrument

7 No Yes No No No fish

is "fish" in the feature vector?

fish is "striped" the previous word?

fish is "guitar" in the feature vector?

instrument fish

Another Example: The restaurant Training data: OutputInput

A first decision tree

But is it the best decision tree we can build?

A better decision tree

4 tests instead of 9 & 11 branches instead of 21

Choosing the best feature

The key problem is choosing which feature to split a given set of examples

Most used strategy: information theory

p(x)p(x)logH(X)

bit 121

- )p(x)logp(x21

Itoss) coin H(fair 22iXx

Entropy (or self-information)

Choosing the best feature (con't)

A)|tentropy(Set)entropy(SeA) gain(Set,

pt)entropy(Se 22

A)|tentropy(Se

b failure of prob. and a sucess of prob. with attribute an for entropy b)I(a,

take can A attribute that values distinct of number the is v

:where

The "discriminating power" of an attribute A given a set S

if the training set contains: p positive examples and n negative examples

Some intuition

Size Color Shape Output

Big Red Circle +

Small Red Circle +

Small Red Square -

Big Blue Circle -

Size is the least discriminating attribute (i.e. smallest information gain)

Shape and color are the most discriminating attribute (i.e. highest information gain)

A small example

So first separate according to either color or shape (root of the tree)

Note: by definition 0log0 is 0

big: 1+ 1- small: 1+ 1-

red: 2+ 1- blue: 0+ 1-

circle: 2+ 1- square: 0+ 1-

Size Color Shape Output

Big Red Circle +

Small Red Circle +

Small Red Square -

Big Blue Circle -

0.31150.6885-1 Color)|tgain(Outpu

0.6885 (0)41

(0.918)43

Color)|tputentropy(Ou

blue)|tputentropy(Ou

0.91831

red)|tputentropy(Ou

tput)entropy(Ou

0.31150.918-1 Shape)|tgain(Outpu

0.918(0)41

(0.918)43

Shape)|tputentropy(Ou

tput)entropy(Ou 22

01 -1 Size)|tgain(Outpu

1(1)21

Size)|tputentropy(Ou

tput)entropy(Ou 22

With the data on p.27, we have:

So root of the tree should be attribute Patrons (we gain more information)

do recursively for subtrees

The restaurant example

0.541bits...44

log 40

log 22

log 20

- x 122

I x 126

I x 124

I x 122

1patron)|tgain(Outpu

bits 042

I x124

I x 124

I x 122

1type)|tgain(Outpu

Back to WSD Need to translate the French word: “Prendre” can be seen as WSD possible translations/senses={take, make, rise,

speak}Observatio

nFeatures/Attributes Sense

Word left

Direct object

Word right

1 … … mesure … … take

2 … … note … … take

3 … … exemple … … take

4 … … decision … … make

5 … … parole … … speak

6 … … parole … … rise

Back to WSD (con't)

(Brown et al., 1991) found: On Canadian Hansard

Ambiguous word

Possible senses / translations

Best Feature

Example

“Prendre” {“take ”, “make”, “rise”, “speak”}

Direct object “Prendre une mesure ” --> “to take”“Prendre une décision ” --> “to make”

“Vouloir” {“to want”, “to like”} Tense Present --> “to want”Conditional --> “to like”

“Cent” {“%”, “¢”} Word to the left

“Pour” --> “%”Number --> “¢”

Training Set With supervised methods, we need a large sense-tagged

training set… where do you get it from? Using a "real" training set

Main standard hand sense-tagged corpora: SEMCOR corpus

portion of the Brown corpus tagged with WordNet senses

SENSEVAL corpus (www.senseval.org/) Standard WSD “competition” like MUC, TREC & DUC

Open Mind Word Expert(OMWE)

Using pseudowords: Artificial ambiguous words created by conflating two or more words. Ex: occurrences of “banana” and “door” can be replaced by

“banana-door” The disambiguation algorithm can now be tested on this data to

disambiguate the pseudoword “banana-door” into either “banana” or “door”

Problems…

With supervised (or unsupervised) methods: need a large amount of work to create a classifier for each

ambiguous word! So most work based in these techniques, report work on a

few words (2 to 12 words) Scaling up these approaches to deal with all ambiguous

words is immense work!

Solution: use lexical resources (ex: machine-readable dictionaries) use distributional properties to improve disambiguation:

Ambiguous words are only used in one sense in any given discourse and with any given collocate.

Naïve Bayes Decision-tree

--> Use of Lexical Resources --> Dictionary-based Thesaurus-based Translation-based

WSD based on sense definitions

(Lesk, 1986) A word’s dictionary definitions are likely to be good

indicators for the sense they define.

Method: Express the dictionary definitions of the ambiguous

word as sets of bag-of-words Express the context of the ambiguous word as a

single bag-of-words from the dictionary definitions of the context words.

Choose the definition of the ambiguous word that has the greatest overlap with the words occurring in its context.

Example "Cone" in dictionary:

DEF-1: “solid body which narrows to a point” BAG = {body, narrows, point, solid}

DEF-2: “something of this shape whether solid or hollow” BAG = {hollow, shape, something, solid}

DEF-3: “fruit of certain evergreen tree” BAG = {evergreen, fruit, tree}

To disambiguate "cone" in "pine cone" "Pine" in dictionary

DEF-1: “kind of evergreen tree” DEF-2: “waste away through sorrow or illness” --> BAG = {evergreen, illness, kind, sorrow, tree, waste}

so "cone" is: score(DEF-1) = {body, narrows, point, solid} {evergreen, illness, kind, sorrow, tree, waste}

= 0 score(DEF-2) = {hollow,shape,something,solid} {evergreen, illness, kind, sorrow, tree,

waste} = 0

score(DEF-3) = {evergreen, fruit, tree} {evergreen, illness, kind, sorrow, tree, waste} = 2

Max overlap: DEF-3

The algorithm

For all senses sk of word w

score(sk) = overlap (

- words in the dictionary definition of sense sk

- the union of the words in all context windows that also appear in a definition of w

)pick the sense s* with the highest score(sk)

Analysis Accuracies of 50-70% on short samples of texts Problem:

dictionary entries for the target words are usually relatively short and may not provide sufficient material to create adequate classifiers Because the words in the context and their definitions must have direct

overlap

One solution: expand the list of words whose definitions make use of the target word Example:

if “deposit” does not occur in the definition of “bank” but “bank” occurs in the definition of “deposit” We can expand the classifier for “bank” to include “deposit” as a relevant

feature However:

just knowing that “deposit” is related to “bank” does not help much if we do not know to which sense of “bank” it is related to --> To make use of “deposit” as a feature, we have to know which

sense of “bank” was being used in the definition Solution:

Use a thesaurus…

--> Use of Lexical Resources Dictionary-based --> Thesaurus-based Translation-based

Thesaurus-Based Disambiguation Thesauri include tags (subject codes) in their entries

that correspond to broad semantic categories Each word is assigned one or more subject codes

which corresponds to its different meanings ANIMAL/INSECT (category 414) TOOLS/MACHINERY (category 348)

The semantic categories of the words in a context determine the semantic category of the whole context

This category, determines which word senses are used

For each subject code, count the number of words in the context that have the same subject code

Select the subject code that has the highest count

Accuracy ~50% (but with difficult and highly ambiguous words)

Some Results

Roget categoriesWord Sense Roget category Accuracy

(Yarowsky, 1992)

bass musical instrument MUSIC 99%

fish ANIMAL,INSECT 100%

star space object UNIVERSE 96%

celebrity ENTERTAINER 95%

star-shaped object INSIGNIA 82%

interest curiosity REASONING 88%

advantage INJUSTICE 34%

financial DEBT 90%

share PROPERTY 38%

--> Use of Lexical Resources Dictionary-based Thesaurus-based --> Translation-based

Translation-Based WSD Words can be disambiguated by looking at how they are

translated in other languages Example: the word “interest”

To disambiguate the word “interest” in “showed interest” German translation of “show” is “zeigen” In German corpus:

we always find “zeigen interesse” we never find “zeigen beteiligung”

So in the original phrase “showed interest”, interest had sense2

To disambiguate the word “interest” in “acquired an interest” German translation of “acquired ” is “erwarb” In German corpus: C(“erwarb”, “beteiligung”) > C(“erwarb”,

“interesse”) So in the original phrase “acquired an interest” interest is sense1

sense1 sense2

Definition legal share attention, concern

German Translation

“Beteiligung” “Interesse”

English phrase “acquire an interest” “show interest”

Translation “erwerb eine Beteiligung”

“Interesse zeigen”

--> Discourse properties Unsupervised Disambiguation

Discourse Properties (Yarowsky, 1995) So far, all methods have considered each occurrence

of ambiguous word separately… But…

One sense per discourse One document --> one sense

One sense per collocation Select some nearby word that give very clues … ie.

select words of a collocation <-> sense of target word

(Yarowsky , 1995) shows a reduction of error rate by 27% when using the discourse constraint!

i.e. assign the majority sense of the discourse to all occurrences of the target word

we can combine these 2 heuristics

Discourse properties --> Unsupervised Disambiguation

Unsupervised Disambiguation

Disambiguate word senses: without supporting tools such as dictionaries and

thesauri without a labeled training text

Without such resources, we cannot really identify/label the senses

ie. cannot say bank-1 or bank-2 we do not even know the different senses of a word!

But we can: Cluster/group the contexts of an ambiguous word into a

number of groups discriminate between these groups without actually

labeling them

Clustering Represent each instance of the ambiguous word as a

vector <f1, f2, f3,…, fv > V is the vocabulary size fi is the frequency of word i in the context.

each vector can be visually represented in an V dimensional space

Clustering

hypothesis: same senses of words will have similar neighboring words

Disambiguation algorithm Identify context vectors corresponding to all

occurrences of a particular word Partition them into regions of high density Tag a sense for each such region Disambiguating a word:

Compute context vector of its occurrence Find the closest centroid of a region Assign the occurrence the sense of that centroid

Evaluating WSD Metrics:

Accuracy: the % of words that are tagged correctly Precision & Recall:

Good : nb of correct answers provided by the system Bad: nb of wrong answers provided by the system Null: nb of cases in which the system doesn’t provide any

answer compared to a gold standard

SEMCOR corpus, SENSEVAL corpus, original text without pseudo-words,…

Difficulty in evaluation: Nature of the senses to distinguish has a huge impact on

results coarse VS fine-grained sense distinction

ex: “chair” --> person VS furniture ex: “bank” --> financial institution VS building

Bounds on Performance Upper and Lower Bounds on Performance:

Measure of how well an algorithm performs relative to the difficulty of the task.

Upper Bound: Human performance Around 97%-99% with few and clearly distinct senses Inter-judge agreement:

With words with clear & distinct senses --> 95% and up With polysemous words with related senses 65%-70%

Lower Bound (or baseline): Usually the assignment of the most frequent sense 90% is excellent for a word with 2 equiprobable senses 90% is trivial for a word with 2 senses with probability ratios

of 9 to 1 !!!

SENSEVAL (www.senseval.org)

Standard WSD “competition” like MUC, TREC & DUC Goals:

Provide a common framework to compare WSD systems

Standardise the task (especially evaluation procedures)

Build and distribute new lexical resources Senseval-1 (1998)

English, French and Italian HECTOR senses (Oxford University Press)

Senseval-2 (2001) 13 languages, including Chinese WordNet senses

Senseval-3 (March 2004) 7 languages (but various tasks) WordNet senses

Training text for "arm" (SENSEVAL-1) <instance id="arm.n.om.053"> <answer instance="arm.n.om.053" senseid="arm%1:08:00::"/>

Many <p="JJ"/> terrestrial <p="JJ"/> vertebrate <p="JJ"/> animals <p="NNS"/> have <p="VBP"/> four <p="CD"/> <ne="_NUM"/> limbs <p="NNS"/> . <p="."/> Those <p="DT"/> attached <p="VBN"/> to <p="TO"/> the <p="DT"/> thoracic <p="JJ"/> portion <p="NN"/> of <p="IN"/> the <p="DT"/> body <p="NN"/> are <p="VBP"/> called <p="VBN"/> " <p="""/> <head> arms <p="NNS"/> </head> . <p="."/> " <p="""/>

</context> </instance>

<context> You <p="PRP"/> are <p="VBP"/> likely <p="JJ"/> to <p="TO"/> find <p="VB"/> a <p="DT"/> rocking_chair <p="NN"/> with <p="IN"/> <head> arms <p="NNS"/> </head> in <p="IN"/> a <p="DT"/> museum <p="NN"/>

" <p="""/> Unlike <p="IN"/> Linder <p="NNP"/> , <p=","/> who <p="WP"/> was <p="VBD"/> reportedly <p="RB"/> carrying <p="VBG"/> a <p="DT"/> Kalashnikov <p="NNP"/> assault_rifle <p="NN"/> for <p="IN"/> protection <p="NN"/> , <p=","/> APSNICA <p="NNP"/> volunteers <p="NNS"/> do <p="VBP"/> not <p="RB"/> bear <p="VB"/> <head> arms <p="NNS"/> </head> . <p="."/>

What is a word sense anyways? “A mental representations of different meaning

of a word” Experiments in psycho-linguistics

Ask subjects classify index cards with sentences containing an ambiguous words into different piles

But inter-subject agreement is low…

Rely on introspection But introspection tends to rationalize often non-rational

decisions

Ask subjects to classify ambiguous words according to dictionary definitions

Some results show high inter-subject agreement, some results show low agreement!!!

comp791a: statistical language processing

Documents

statistical natural language processing and applications

statistical processing the results of …

comp791a: statistical language processing

1 comp791a: statistical language processing machine...

statistical post-processing methods and their

practical statistical signal processing using matlab signal...

a community statistical post-processing system

using image processing and statistical analysis to...

statistical signal processing(1)

statistical digital signal processing

sam signal processing examples statistical signal processing...

ee602 statistical signal processing: estimation and...

automatic statistical processing of multibeam echosounder...

statistical natural language processing

fundamentals of statistical signal processing: estimation

processing and analysis: statistical …signal processing...

eleg-636: statistical signal processing

statistical language processing

1 comp791a: statistical language processing collocations...

solutionsmanual-statistical and adaptive signal processing