1 discussion of assigned readings lecture 13. 2 how does ‘smoothing’ help in bayesian word sense...

70
1 Discussion of assigned readings Lecture 13

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

1

Discussion of assigned readings Lecture 13

Page 2: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

2

How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing?– Most words appear rarely (remember Heap’s law)

The more data you see, the more words you had never seen before you encounter

– Now imagine you use the word before and the word after the target word as a feature

– When you test your classifier, you will see context words you never saw during training

Page 3: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

3

How does the naïve Bayes classifier work?

Compute the probability of the observed context assuming each sense– Multiplication of the probabilities of each individual

feature– A word not seen in training will have zero

probability

Choose the most probable sense

Page 4: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

4

Lesson 2: zeros or not?

Zipf’s Law:– A small number of events occur with high frequency– A large number of events occur with low frequency– You can quickly collect statistics on the high frequency events– You might have to wait an arbitrarily long time to get valid statistics on low

frequency events Result:

– Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate!

– Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN!

– How to address? Answer:

– Estimate the likelihood of unseen N-grams!

Page 5: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

5

Dealing with unknown words

Training:- Assume a fixed vocabulary (e.g. all words that occur at least 5 times in the corpus)- Replace all other words by a token <UNK>- Estimate the model on this corpus

Testing:- Replace all unknown words by <UNK>- Run the model

Page 6: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

6

Smoothing is like Robin Hood:Steal from the rich and give to the poor (in probability mass)

Page 7: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

7

Laplace smoothing

Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate:

Laplace estimate:

Page 8: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

8

Why do pseudo-words give an optimistic results for WSD?

Banana-door– Find sentences containing either of the words– Replace the occurrence of each of the words with a new

symbol– Here is a big annotated corpus!– The correct sense is the original word

The problem is that different sense of the same word are often semantically related

– Pseudo-words are pomonymous rather than polysemous

Page 9: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

9

Let’s correct this

Which is better, lexical sample or decision tree classifier, in terms of system performance and results?

Lexical sample task– Way of formulating the task of WSD– A small pre-selected set of target words

Classifiers used to solve the task– Naïve Bayes– Decision trees– Support vector machines

Page 10: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

10

What is relative entropy?

KL divergence/relative entropy

Page 11: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

11

Selectional preference strength

eat x [FOOD] be y [PRETTY-MUCH-ANYTHING]

The distribution of expected semantic classes (FOOD, PEOPLE, LIQUIDS)

The distribution of expected semantic classes for a particular verb

The greater the difference between these distributions, the more information the verb is giving us about possible objects

Page 12: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

12

Bootstrapping WSD algorithm

How the algorithm after defining plant under life and manufacturing can go on to define animal and microscopic from life? How does it do that recursively or something? Basically, how other words get labeled other than those we set out to label.

Page 13: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

13

Bootstrapping

What if you don’t have enough data to train a system…

Bootstrap– Pick a word that you as an analyst think will co-occur with

your target word in particular sense– Grep through your corpus for your target word and the

hypothesized word– Assume that the target tag is the right one

For bass– Assume play occurs with the music sense and fish occurs

with the fish sense

Page 14: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

14

Sentences extracting using “fish” and “play”

Page 15: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

15

Page 16: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

16

SimLin: why multiply by 2

Common(A,B) Description(A,B)

Page 17: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

17

Word similarity

Synonymy is a binary relation– Two words are either synonymous or not

We want a looser metric– Word similarity or– Word distance

Two words are more similar– If they share more features of meaning

Actually these are really relations between senses:– Instead of saying “bank is like fund”– We say

Bank1 is similar to fund3 Bank2 is similar to slope5

We’ll compute them over both words and senses

Page 18: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

18

Two classes of algorithms

Thesaurus-based algorithms– Based on whether words are “nearby” in Wordnet

Distributional algorithms– By comparing words based on their context

I like having X for dinner? What are the possible values of X

Page 19: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

19

Thesaurus-based word similarity

We could use anything in the thesaurus– Meronymy– Glosses– Example sentences

In practice– By “thesaurus-based” we just mean

Using the is-a/subsumption/hypernym hierarchy Word similarity versus word relatedness

– Similar words are near-synonyms– Related could be related any way

Car, gasoline: related, not similar Car, bicycle: similar

Page 20: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

20

Path based similarity

Two words are similar if nearby in thesaurus hierarchy (i.e. short path between them)

Page 21: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

21

Refinements to path-based similarity

pathlen(c1,c2) = number of edges in the shortest path between the sense nodes c1 and c2

simpath(c1,c2) = -log pathlen(c1,c2)

wordsim(w1,w2) =– maxc1senses(w1),c2senses(w2) sim(c1,c2)

Page 22: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

22

Problem with basic path-based similarity

Assumes each link represents a uniform distance

Nickel to money seem closer than nickel to standard

Instead:– Want a metric which lets us represent the cost of

each edge independently

Page 23: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

23

Information content similarity metrics

Let’s define P(C) as:– The probability that a randomly selected word in a

corpus is an instance of concept c– Formally: there is a distinct random variable,

ranging over words, associated with each concept in the hierarchy

– P(root)=1– The lower a node in the hierarchy, the lower its

probability

Page 24: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

24

Information content similarity

Train by counting in a corpus– 1 instance of “dime” could count toward frequency

of coin, currency, standard, etc

More formally:

P(c) count(w)

wwords(c )

N

Page 25: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

25

Information content similarity

WordNet hieararchy augmented with probabilities P(C)

Page 26: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

26

Information content: definitions

Information content:– IC(c)=-logP(c)

Lowest common subsumer LCS(c1,c2) – I.e. the lowest node in the

hierarchy– That subsumes (is a hypernym

of) both c1 and c2

Page 27: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

27

Resnik method

The similarity between two words is related to their common information

The more two words have in common, the more similar they are

Resnik: measure the common information as:– The info content of the lowest common subsumer of the

two nodes– simresnik(c1,c2) = -log P(LCS(c1,c2))

Page 28: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

28

SimLin(c1,c2) =

2 x log P (LCS(c1,c2))/ (log P(c1) + log P(c2))

SimLin(hill,coast) =

2 x log P (geological-formation)) /

(log P(hill) + log P(coast))

= .59

Page 29: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

29

Extended Lesk

Two concepts are similar if their glosses contain similar words– Drawing paper: paper that is specially prepared for

use in drafting– Decal: the art of transferring designs from specially

prepared paper to a wood or glass or metal surface For each n-word phrase that occurs in both glosses

– Add a score of n2 – Paper and specially prepared for 1 + 4 = 5

Page 30: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

30

Summary: thesaurus-based similarity

Page 31: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

31

Problems with thesaurus-based methods

We don’t have a thesaurus for every language

Even if we do, many words are missing They rely on hyponym info:

– Strong for nouns, but lacking for adjectives and even verbs

Alternative– Distributional methods for word similarity

Page 32: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

32

Distributional methods for word similarity

– A bottle of tezgüino is on the table– Everybody likes tezgüino– Tezgüino makes you drunk– We make tezgüino out of corn.

Intuition: – just from these contexts a human could guess

meaning of tezguino– So we should look at the surrounding contexts, see

what other words have similar context.

Page 33: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

33

Context vector

Consider a target word w Suppose we had one binary feature fi for each of

the N words in the lexicon vi

Which means “word vi occurs in the neighborhood of w”

w=(f1,f2,f3,…,fN) If w=tezguino, v1 = bottle, v2 = drunk, v3 = matrix: w = (1,1,0,…)

Page 34: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

34

Intuition

Define two words by these sparse features vectors

Apply a vector distance metric Say that two words are similar if two vectors

are similar

Page 35: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

35

Distributional similarity

So we just need to specify 3 things

1. How the co-occurrence terms are defined

2. How terms are weighted (frequency? Logs? Mutual information?)

3. What vector distance metric should we use? Cosine? Euclidean distance?

Page 36: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

36

Defining co-occurrence vectors

He drinks X every morning

Idea: parse the sentence, extract syntactic dependencies:

Page 37: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

37

Co-occurrence vectors based on dependencies

Page 38: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

38

Measures of association with context

We have been using the frequency of some feature as its weight or value

But we could use any function of this frequency

Let’s consider one feature f=(r,w’) = (obj-of,attack) P(f|w)=count(f,w)/count(w)

Assocprob(w,f)=p(f|w)

Page 39: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

39

Weighting: Mutual Information

Pointwise mutual information: measure of how often two events x and y occur, compared with what we would expect if they were independent:

PMI between a target word w and a feature f :

Page 40: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

40

Mutual information intuition

Objects of the verb drink

Page 41: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

41

Lin is a variant on PMI

Pointwise mutual information: how often two events x and y occur, compared with what we would expect if they were independent:

PMI between a target word w and a feature f :

Lin measure: breaks down expected value for P(f) differently:

Page 42: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

42

Similarity measures

Page 43: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

43

What is the baseline algorithm (Lin&Hovy paper)

Good question, they don’t say! Random selection First N words

Page 44: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

44

TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages

Page 45: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

45

Unsupervised Discourse Segmentation

Hearst (1997): 21-pgraph science news article called “Stargazers”

Goal: produce the following subtopic segments:

Page 46: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

46

Intuition of cohesion-based segmentation

Sentences or paragraphs in a subtopic are cohesive with each other

But not with paragraphs in a neighboring subtopic

Thus if we measured the cohesion between every neighboring sentences– We might expect a ‘dip’ in cohesion at subtopic

boundaries.

Page 47: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

47

TextTiling (Hearst 1997)

1. Tokenization– Each space-deliminated word– Converted to lower case– Throw out stop list words– Stem the rest– Group into pseudo-sentences of length w=20

2. Lexical Score Determination: cohesion score– Average similarity (cosine measure) between gap (20

pseudo sentences)

3. Boundary Identification

Page 48: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

48

TextTiling algorithm

Page 49: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

49

Cosine

Page 50: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

50

Could you use stemming to compare synsets (distances between words)

E.g. musician music

Is there a way to deal with inconsistent granularities of relations

Page 51: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

51

Zipf's law and Heap's law

Both are related to the fact that there are a lot of words that one would see very rarely

This means that when we build language models (estimate probabilities of words) we will have unreliable estimates for many– This is why we were talking about smoothing!

Page 52: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

52

Heap’s law: estimating the number of terms

bkTM M vocabulary size (number of terms)

T number of tokens

30 < k < 100

b = 0.5

Linear relation between vocabulary size and number of tokens in log-log space

Page 53: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

53

Page 54: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

54

Zipf’s law: modeling the distribution of terms

The collection frequency of the ith most common term is proportional to 1/i

If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc

icfi

1

ikccf

cicf

i

ki

logloglog

Page 55: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

55

Page 56: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

56

Bigram Model

Approximate by – P(unicorn|the mythical) by P(unicorn|mythical)

Markov assumption: the probability of a word depends only on the probability of a limited history

Generalization: the probability of a word depends only on the probability of the n previous words– trigrams, 4-grams, …– the higher n is, the more data needed to train– backoff models…

)11|( nn wwP )|( 1nn wwP

Page 57: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

57

A Simple Example: bigram model

– P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)

Page 58: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

58

Generating WSJ

Page 59: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

59

Maximum tf normalization: wasn’t useful for summarization, why discuss it then?

Page 60: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

60

Accuracy, precision and recall

Page 61: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

61

Accuracy

Problematic measure for IR evaluation

– (tp+tn)/(tp+tn+fp+fn)

99.9% of the documents will be nonrelevant– Trivially achieved high performance

Page 62: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

62

Precision

Page 63: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

63

Recall

Page 64: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

64

Precision/Recall trade off

Which is more important depends on the user needs

– Typical web users High precision in the first page of results

– Paralegals and intelligence analysts Need high recall Willing to tolerate some irrelevant documents as a price

Page 65: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

65

F-measure

Page 66: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

66

Explain what vector representation means? I tried explaining it to my mother and had difficulty as to what this "vector" implies. She kept saying she thought of vectors in terms of graphs.

Page 67: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

67

How can NLP be used in spam?

Page 68: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

68

Using the entire string as feature?NO!

Topic categorization: classify the document into semantics topics

The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie.

One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

Page 69: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

69

What is relative entropy?

KL divergence/relative entropy

Page 70: 1 Discussion of assigned readings Lecture 13. 2 How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? – Most

70

Selectional preference strengthHow do you find verbs that strongly associated with a given subject?

eat x [FOOD] be y [PRETTY-MUCH-ANYTHING]

The distribution of expected semantic classes (FOOD, PEOPLE, LIQUIDS)

The distribution of expected semantic classes for a particular verb

The greater the difference between these distributions, the more information the verb is giving us about possible objects