word sense disambiguation (2) instructor: paul tarau, based on rada mihalcea’s original slides

Word sense disambiguation (2)

Instructor: Paul Tarau, based on Rada Mihalcea’s original slides

Note: Some of the material in this slide set was adapted from a tutorial given by Rada Mihalcea & Ted Pedersen at ACL 2005

What is Supervised Learning?

Collect a set of examples that illustrate the various possible classifications or outcomes of an event.

Identify patterns in the examples associated with each particular class of the event.

Generalize those patterns into rules.Apply the rules to classify a new event.

Learn from these examples :“when do I go to the store?”

DayCLASS

Go to Store?F1Hot

Outside?

F2 Slept Well?

F3 Ate Well?

1 YES YES NO NO

2 NO YES NO YES

3 YES NO NO NO

4 NO NO NO YES

Task Definition: Supervised WSD

Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques.

ResourcesSense Tagged TextDictionary (implicit source of sense inventory)Syntactic Analysis (POS tagger, Chunker, Parser, …)

ScopeTypically one target word per contextPart of speech of target word resolvedLends itself to “targeted word” formulation

Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs

Sense Tagged Text

Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbersMy bank/1 charges too much for an overdraft.

I went to the bank/1 to deposit my check and get a new ATM card.The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.My grandfather planted his pole in the bank/2 and got a great big catfish! The bank/2 is pretty muddy, I can’t walk there.

Two Bags of Words (Co-occurrences in the “window of context”)

FINANCIAL_BANK_BAG: a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were

RIVER_BANK_BAG: a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West

Simple Supervised Approach

Given a sentence S containing “bank”:

For each word Wi in SIf Wi is in FINANCIAL_BANK_BAG then

Sense_1 = Sense_1 + 1;If Wi is in RIVER_BANK_BAG then

Sense_2 = Sense_2 + 1;

If Sense_1 > Sense_2 then print “Financial” else if Sense_2 > Sense_1 then print “River”

else print “Can’t Decide”;

Supervised Methodology

Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities.One tagged word per instance/lexical sample disambiguation

Select a set of features with which to represent context.co-occurrences, collocations, POS tags, verb-obj relations, etc...

Convert sense-tagged training instances to feature vectors.Apply a machine learning algorithm to induce a classifier.

Form – structure or relation among featuresParameters – strength of feature interactions

Convert a held out sample of test data into feature vectors.“correct” sense tags are known but not used

Apply classifier to test instances to assign a sense tag.

From Text to Feature Vectors

My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv the/det banks/SHORE of/prep the/det Mississippi/noun River/noun. (S1)

The/det bank/FINANCE issued/verb a/det check/noun for/prep the/det amount/noun of/prep interest/noun. (S2)

P-2 P-1 P+1 P+2

fish check

river

interest

SENSE TAG

S1 adv det prep

det Y N Y N SHORE

S2 det verb

det N Y N Y FINANCE

Supervised Learning Algorithms

Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results:Support Vector MachinesNearest Neighbor ClassifiersDecision Trees Decision ListsNaïve Bayesian ClassifiersPerceptronsNeural NetworksGraphical ModelsLog Linear Models

Naïve Bayesian Classifier

Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997)…Word Sense Disambiguation is no exception

Assumes conditional independence among features, given the sense of a word.The form of the model is assumed, but parameters are estimated

from training instancesWhen applied to WSD, features are often “a bag of words”

that come from the training dataUsually thousands of binary features that indicate if a word is

present in the context of the target word (or not)

Bayesian Inference

Given observed features, what is most likely sense?Estimate probability of observed features given sense Estimate unconditional probability of senseUnconditional probability of features is a normalizing term, doesn’t affect sense classification

),...,3,2,1()()*|,...,3,2,1(

),...,3,2,1| ( FnFFFpSpSFnFFFp

FnFFFSp

Naïve Bayesian Model

S

F2 F3 F4 F1 Fn

)|(*...*)|2(*)|1()|,...,2,1( SFnpSFpSFpSFnFFP

The Naïve Bayesian Classifier

Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)

P(S=1) = 1,500/2000 = .75P(S=2) = 500/2,000 = .25

Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.

P(F1=“credit”) = 204/2000 = .102P(F1=“credit”|S=1) = 200/1,500 = .133P(F1=“credit”|S=2) = 4/500 = .008

Given a test instance that has one feature “credit”P(S=1|F1=“credit”) = .133*.75/.102 = .978P(S=2|F1=“credit”) = .008*.25/.102 = .020

)(*)|(*...*)|1(argmax SpSFnpSFpsenseSsense

Comparative Results

(Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line…

(Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line…

(Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words…

…All found that Naïve Bayesian Classifier performed as well as any of the other methods!

Decision Lists and Trees

Very widely used in Machine Learning. Decision trees used very early for WSD research (e.g., Kelly

and Stone, 1975; Black, 1988). Represent disambiguation problem as a series of questions

(presence of feature) that reveal the sense of a word.List decides between two senses after one positive answerTree allows for decision among multiple senses after a series of

answersUses a smaller, more refined set of features than “bag of

words” and Naïve Bayes.More descriptive and easier to interpret.

Decision List for WSD (Yarowsky, 1994) Identify collocational features from sense tagged data. Word immediately to the left or right of target :

I have my bank/1 statement.The river bank/2 is muddy.

Pair of words to immediate left or right of target :The world’s richest bank/1 is here in New York.The river bank/2 is muddy.

Words found within k positions to left or right of target, where k is often 10-50 :My credit is just horrible because my bank/1 has made several

mistakes with my account and the balance is very low.

Building the Decision List

Sort order of collocation tests using log of conditional probabilities. Words most indicative of one sense (and not the other) will be ranked highly.

))|2(

)|1((log

inCollocatioiFSpinCollocatioiFSp

Abs

Computing DL score

Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)P(S=1) = 1,500/2,000 = .75P(S=2) = 500/2,000 = .25

Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.P(F1=“credit”) = 204/2,000 = .102P(F1=“credit”|S=1) = 200/1,500 = .133P(F1=“credit”|S=2) = 4/500 = .008

From Bayes Rule… P(S=1|F1=“credit”) = .133*.75/.102 = .978P(S=2|F1=“credit”) = .008*.25/.102 = .020

DL Score = abs (log (.978/.020)) = 3.89

Using the Decision List

Sort DL-score, go through test instance looking for matching feature. First match reveals sense…

DL-score Feature Sense3.89 credit within bank Bank/1 financial

2.20 bank is muddy Bank/2 river

1.09 pole within bank Bank/2 river

0.00 of the bank N/A

Using the Decision List

CREDIT?

BANK/1 FINANCIAL IS MUDDY?

POLE? BANK/2 RIVER

BANK/2 RIVER

Learning a Decision Tree

Identify the feature that most “cleanly” divides the training data into the known senses.“Cleanly” measured by information gain or gain ratio. Create subsets of training data according to feature values.

Find another feature that most cleanly divides a subset of the training data.

Continue until each subset of training data is “pure” or as clean as possible.

Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993)

In Senseval-1, a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)

Supervised WSD with Individual ClassifiersMany supervised Machine Learning algorithms have been

applied to Word Sense Disambiguation, most work reasonably well. (Witten and Frank, 2000) is a great intro. to supervised learning.

Features tend to differentiate among methods more than the learning algorithms.

Good sets of features tend to include:Co-occurrences or keywords (global)Collocations (local)Bigrams (local and global)Part of speech (local)Predicate-argument relations

Verb-object, subject-verb,Heads of Noun and Verb Phrases

Convergence of Results

Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another.Senseval-1, a number of systems in range of 74-78% accuracy for

English Lexical Sample task.Senseval-2, a number of systems in range of 61-64% accuracy for

English Lexical Sample task.Senseval-3, a number of systems in range of 70-73% accuracy for

English Lexical Sample task…What to do next?

Ensembles of Classifiers

Classifier error has two components (Bias and Variance)Some algorithms (e.g., decision trees) try and build a

representation of the training data – Low Bias/High VarianceOthers (e.g., Naïve Bayes) assume a parametric form and don’t

represent the training data – High Bias/Low VarianceCombining classifiers with different bias variance

characteristics can lead to improved overall accuracy“Bagging” a decision tree can smooth out the effect of small

variations in the training data (Breiman, 1996)Sample with replacement from the training data to learn multiple

decision trees.Outliers in training data will tend to be obscured/eliminated.

Ensemble Considerations

Must choose different learning algorithms with significantly different bias/variance characteristics.Naïve Bayesian Classifier versus Decision Tree

Must choose feature representations that yield significantly different (independent?) views of the training data.Lexical versus syntactic features

Must choose how to combine classifiers. Simple Majority VotingAveraging of probabilities across multiple classifier outputMaximum Entropy combination (e.g., Klein, et. al., 2002)

Ensemble Results

(Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers.Many Naïve Bayesian Classifiers trained on varying sized

windows of context / bags of words.Classifiers combined by a weighted vote

(Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers.Rich set of collocational and syntactic features.Combined via linear combination of top three classifiers.

Many Senseval-2 and Senseval-3 systems employed ensemble methods.

Task Definition: Minimally supervised WSDSupervised WSD = learning sense classifiers starting with

annotated dataMinimally supervised WSD = learning sense classifiers from

annotated data, with minimal human supervisionExamples

Automatically bootstrap a corpus starting with a few human annotated examples

Use monosemous relatives / dictionary definitions to automatically construct sense tagged data

Rely on Web-users + active learning for corpus annotation

Bootstrapping WSD Classifiers

Build sense classifiers with little training dataExpand applicability of supervised WSD

Bootstrapping approachesCo-trainingSelf-trainingYarowsky algorithm

Bootstrapping Recipe

Ingredients(Some) labeled data(Large amounts of) unlabeled data(One or more) basic classifiers

OutputClassifier that improves over the basic classifiers

… plants#1 and animals …… industry plant#2 …

… building the only atomic plant …… plant growth is retarded …… a herb or flowering plant …… a nuclear power plant …… building a new vehicle plant …… the animal and plant life …… the passion-fruit plant …

Classifier 1

Classifier 2

… plant#1 growth is retarded …… a nuclear power plant#2 …

Co-training / Self-training

1. Create a pool of examples U' choose P random examples from U

2. Loop for I iterationsTrain Ci on L and label U'Select G most confident examples and add to L

maintain distribution in LRefill U' with examples from U

keep U' at constant size P

– A set L of labeled training examples– A set U of unlabeled examples– Classifiers Ci

(Blum and Mitchell 1998)Two classifiers

independent views[independence condition can be relaxed]

Co-training in Natural Language LearningStatistical parsing (Sarkar 2001)Co-reference resolution (Ng and Cardie 2003)Part of speech tagging (Clark, Curran and Osborne 2003)...

Co-training

Self-training

(Nigam and Ghani 2000)One single classifierRetrain on its own outputSelf-training for Natural Language Learning

Part of speech tagging (Clark, Curran and Osborne 2003)Co-reference resolution (Ng and Cardie 2003)

several classifiers through bagging

word sense disambiguation (2) instructor: paul tarau, based on rada mihalcea’s original slides

Documents

appropriate sense

tagged word

word wi

given target word

slide set

set of examples

target word resolvedlends

supervised learning