1 discussion of assigned readings lecture 13. 2 how does ‘smoothing’ help in bayesian word sense...
Post on 19-Dec-2015
213 views
TRANSCRIPT
1
Discussion of assigned readings Lecture 13
2
How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing?– Most words appear rarely (remember Heap’s law)
The more data you see, the more words you had never seen before you encounter
– Now imagine you use the word before and the word after the target word as a feature
– When you test your classifier, you will see context words you never saw during training
3
How does the naïve Bayes classifier work?
Compute the probability of the observed context assuming each sense– Multiplication of the probabilities of each individual
feature– A word not seen in training will have zero
probability
Choose the most probable sense
4
Lesson 2: zeros or not?
Zipf’s Law:– A small number of events occur with high frequency– A large number of events occur with low frequency– You can quickly collect statistics on the high frequency events– You might have to wait an arbitrarily long time to get valid statistics on low
frequency events Result:
– Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate!
– Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN!
– How to address? Answer:
– Estimate the likelihood of unseen N-grams!
5
Dealing with unknown words
Training:- Assume a fixed vocabulary (e.g. all words that occur at least 5 times in the corpus)- Replace all other words by a token <UNK>- Estimate the model on this corpus
Testing:- Replace all unknown words by <UNK>- Run the model
6
Smoothing is like Robin Hood:Steal from the rich and give to the poor (in probability mass)
7
Laplace smoothing
Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate:
Laplace estimate:
8
Why do pseudo-words give an optimistic results for WSD?
Banana-door– Find sentences containing either of the words– Replace the occurrence of each of the words with a new
symbol– Here is a big annotated corpus!– The correct sense is the original word
The problem is that different sense of the same word are often semantically related
– Pseudo-words are pomonymous rather than polysemous
9
Let’s correct this
Which is better, lexical sample or decision tree classifier, in terms of system performance and results?
Lexical sample task– Way of formulating the task of WSD– A small pre-selected set of target words
Classifiers used to solve the task– Naïve Bayes– Decision trees– Support vector machines
10
What is relative entropy?
KL divergence/relative entropy
11
Selectional preference strength
eat x [FOOD] be y [PRETTY-MUCH-ANYTHING]
The distribution of expected semantic classes (FOOD, PEOPLE, LIQUIDS)
The distribution of expected semantic classes for a particular verb
The greater the difference between these distributions, the more information the verb is giving us about possible objects
12
Bootstrapping WSD algorithm
How the algorithm after defining plant under life and manufacturing can go on to define animal and microscopic from life? How does it do that recursively or something? Basically, how other words get labeled other than those we set out to label.
13
Bootstrapping
What if you don’t have enough data to train a system…
Bootstrap– Pick a word that you as an analyst think will co-occur with
your target word in particular sense– Grep through your corpus for your target word and the
hypothesized word– Assume that the target tag is the right one
For bass– Assume play occurs with the music sense and fish occurs
with the fish sense
14
Sentences extracting using “fish” and “play”
15
16
SimLin: why multiply by 2
Common(A,B) Description(A,B)
17
Word similarity
Synonymy is a binary relation– Two words are either synonymous or not
We want a looser metric– Word similarity or– Word distance
Two words are more similar– If they share more features of meaning
Actually these are really relations between senses:– Instead of saying “bank is like fund”– We say
Bank1 is similar to fund3 Bank2 is similar to slope5
We’ll compute them over both words and senses
18
Two classes of algorithms
Thesaurus-based algorithms– Based on whether words are “nearby” in Wordnet
Distributional algorithms– By comparing words based on their context
I like having X for dinner? What are the possible values of X
19
Thesaurus-based word similarity
We could use anything in the thesaurus– Meronymy– Glosses– Example sentences
In practice– By “thesaurus-based” we just mean
Using the is-a/subsumption/hypernym hierarchy Word similarity versus word relatedness
– Similar words are near-synonyms– Related could be related any way
Car, gasoline: related, not similar Car, bicycle: similar
20
Path based similarity
Two words are similar if nearby in thesaurus hierarchy (i.e. short path between them)
21
Refinements to path-based similarity
pathlen(c1,c2) = number of edges in the shortest path between the sense nodes c1 and c2
simpath(c1,c2) = -log pathlen(c1,c2)
wordsim(w1,w2) =– maxc1senses(w1),c2senses(w2) sim(c1,c2)
22
Problem with basic path-based similarity
Assumes each link represents a uniform distance
Nickel to money seem closer than nickel to standard
Instead:– Want a metric which lets us represent the cost of
each edge independently
23
Information content similarity metrics
Let’s define P(C) as:– The probability that a randomly selected word in a
corpus is an instance of concept c– Formally: there is a distinct random variable,
ranging over words, associated with each concept in the hierarchy
– P(root)=1– The lower a node in the hierarchy, the lower its
probability
24
Information content similarity
Train by counting in a corpus– 1 instance of “dime” could count toward frequency
of coin, currency, standard, etc
More formally:
P(c) count(w)
wwords(c )
N
25
Information content similarity
WordNet hieararchy augmented with probabilities P(C)
26
Information content: definitions
Information content:– IC(c)=-logP(c)
Lowest common subsumer LCS(c1,c2) – I.e. the lowest node in the
hierarchy– That subsumes (is a hypernym
of) both c1 and c2
27
Resnik method
The similarity between two words is related to their common information
The more two words have in common, the more similar they are
Resnik: measure the common information as:– The info content of the lowest common subsumer of the
two nodes– simresnik(c1,c2) = -log P(LCS(c1,c2))
28
SimLin(c1,c2) =
2 x log P (LCS(c1,c2))/ (log P(c1) + log P(c2))
SimLin(hill,coast) =
2 x log P (geological-formation)) /
(log P(hill) + log P(coast))
= .59
29
Extended Lesk
Two concepts are similar if their glosses contain similar words– Drawing paper: paper that is specially prepared for
use in drafting– Decal: the art of transferring designs from specially
prepared paper to a wood or glass or metal surface For each n-word phrase that occurs in both glosses
– Add a score of n2 – Paper and specially prepared for 1 + 4 = 5
30
Summary: thesaurus-based similarity
31
Problems with thesaurus-based methods
We don’t have a thesaurus for every language
Even if we do, many words are missing They rely on hyponym info:
– Strong for nouns, but lacking for adjectives and even verbs
Alternative– Distributional methods for word similarity
32
Distributional methods for word similarity
– A bottle of tezgüino is on the table– Everybody likes tezgüino– Tezgüino makes you drunk– We make tezgüino out of corn.
Intuition: – just from these contexts a human could guess
meaning of tezguino– So we should look at the surrounding contexts, see
what other words have similar context.
33
Context vector
Consider a target word w Suppose we had one binary feature fi for each of
the N words in the lexicon vi
Which means “word vi occurs in the neighborhood of w”
w=(f1,f2,f3,…,fN) If w=tezguino, v1 = bottle, v2 = drunk, v3 = matrix: w = (1,1,0,…)
34
Intuition
Define two words by these sparse features vectors
Apply a vector distance metric Say that two words are similar if two vectors
are similar
35
Distributional similarity
So we just need to specify 3 things
1. How the co-occurrence terms are defined
2. How terms are weighted (frequency? Logs? Mutual information?)
3. What vector distance metric should we use? Cosine? Euclidean distance?
36
Defining co-occurrence vectors
He drinks X every morning
Idea: parse the sentence, extract syntactic dependencies:
37
Co-occurrence vectors based on dependencies
38
Measures of association with context
We have been using the frequency of some feature as its weight or value
But we could use any function of this frequency
Let’s consider one feature f=(r,w’) = (obj-of,attack) P(f|w)=count(f,w)/count(w)
Assocprob(w,f)=p(f|w)
39
Weighting: Mutual Information
Pointwise mutual information: measure of how often two events x and y occur, compared with what we would expect if they were independent:
PMI between a target word w and a feature f :
40
Mutual information intuition
Objects of the verb drink
41
Lin is a variant on PMI
Pointwise mutual information: how often two events x and y occur, compared with what we would expect if they were independent:
PMI between a target word w and a feature f :
Lin measure: breaks down expected value for P(f) differently:
42
Similarity measures
43
What is the baseline algorithm (Lin&Hovy paper)
Good question, they don’t say! Random selection First N words
44
TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages
45
Unsupervised Discourse Segmentation
Hearst (1997): 21-pgraph science news article called “Stargazers”
Goal: produce the following subtopic segments:
46
Intuition of cohesion-based segmentation
Sentences or paragraphs in a subtopic are cohesive with each other
But not with paragraphs in a neighboring subtopic
Thus if we measured the cohesion between every neighboring sentences– We might expect a ‘dip’ in cohesion at subtopic
boundaries.
47
TextTiling (Hearst 1997)
1. Tokenization– Each space-deliminated word– Converted to lower case– Throw out stop list words– Stem the rest– Group into pseudo-sentences of length w=20
2. Lexical Score Determination: cohesion score– Average similarity (cosine measure) between gap (20
pseudo sentences)
3. Boundary Identification
48
TextTiling algorithm
49
Cosine
50
Could you use stemming to compare synsets (distances between words)
E.g. musician music
Is there a way to deal with inconsistent granularities of relations
51
Zipf's law and Heap's law
Both are related to the fact that there are a lot of words that one would see very rarely
This means that when we build language models (estimate probabilities of words) we will have unreliable estimates for many– This is why we were talking about smoothing!
52
Heap’s law: estimating the number of terms
bkTM M vocabulary size (number of terms)
T number of tokens
30 < k < 100
b = 0.5
Linear relation between vocabulary size and number of tokens in log-log space
53
54
Zipf’s law: modeling the distribution of terms
The collection frequency of the ith most common term is proportional to 1/i
If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc
icfi
1
ikccf
cicf
i
ki
logloglog
55
56
Bigram Model
Approximate by – P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption: the probability of a word depends only on the probability of a limited history
Generalization: the probability of a word depends only on the probability of the n previous words– trigrams, 4-grams, …– the higher n is, the more data needed to train– backoff models…
)11|( nn wwP )|( 1nn wwP
57
A Simple Example: bigram model
– P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
58
Generating WSJ
59
Maximum tf normalization: wasn’t useful for summarization, why discuss it then?
60
Accuracy, precision and recall
61
Accuracy
Problematic measure for IR evaluation
– (tp+tn)/(tp+tn+fp+fn)
99.9% of the documents will be nonrelevant– Trivially achieved high performance
62
Precision
63
Recall
64
Precision/Recall trade off
Which is more important depends on the user needs
– Typical web users High precision in the first page of results
– Paralegals and intelligence analysts Need high recall Willing to tolerate some irrelevant documents as a price
65
F-measure
66
Explain what vector representation means? I tried explaining it to my mother and had difficulty as to what this "vector" implies. She kept saying she thought of vectors in terms of graphs.
67
How can NLP be used in spam?
68
Using the entire string as feature?NO!
Topic categorization: classify the document into semantics topics
The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie.
One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.
69
What is relative entropy?
KL divergence/relative entropy
70
Selectional preference strengthHow do you find verbs that strongly associated with a given subject?
eat x [FOOD] be y [PRETTY-MUCH-ANYTHING]
The distribution of expected semantic classes (FOOD, PEOPLE, LIQUIDS)
The distribution of expected semantic classes for a particular verb
The greater the difference between these distributions, the more information the verb is giving us about possible objects