lecture 24 distributional word similarity ii topics distributional based word similarity example pmi...
DESCRIPTION
– 3 – CSCE 771 Spring 2013 Pointwise Mutual Informatiom (PMI) mutual Information Church and Hanks 1989 (eq 20.36) Pointwise Mutual Information (PMI) Fano 1961 . (eq 20.37) assoc-PMI (eq 20.38)TRANSCRIPT
Lecture 24Distributional Word Similarity II
Topics Topics Distributional based word similarity example PMI context = syntactic dependencies
Readings:Readings: NLTK book Chapter 2 (wordnet) Text Chapter 20
April 15, 2013
CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2013
OverviewLast TimeLast Time
Finish up Thesaurus based similarity …
Distributional based word similarity
TodayToday Last Lectures slides 21- Distributional based word similarity II syntax based contexts
Readings: Readings: Text 19,20 NLTK Book: Chapter 10
Next Time: Computational Lexical Semantics IINext Time: Computational Lexical Semantics II
– 3 – CSCE 771 Spring 2013
Pointwise Mutual Informatiom (PMI) mutual Information Church and Hanks 1989mutual Information Church and Hanks 1989 (eq 20.36) (eq 20.36)
Pointwise Mutual Information (PMI) Fano 1961Pointwise Mutual Information (PMI) Fano 1961 . (eq 20.37) . (eq 20.37)
assoc-PMIassoc-PMI (eq 20.38) (eq 20.38)
– 4 – CSCE 771 Spring 2013
Computing PPMI Matrix F with W (words) rows and C (contexts) Matrix F with W (words) rows and C (contexts)
columnscolumns ffijij is frequency of w is frequency of wii in c in cjj, ,
– 5 – CSCE 771 Spring 2013
Example computing PPMI
computer data pinch result salt
apricot 0 0 1 0 1
pineapple 0 0 1 0 1
digital 2 1 0 1 0
information 1 6 0 4 0
Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning
p(w information, c=data) = p(w information) =
p(c=data) =
– 6 – CSCE 771 Spring 2013
Example computing PPMI
computer data pinch result salt
apricot 0 0 1 0 1
pineapple 0 0 1 0 1
digital 2 1 0 1 0
information 1 6 0 4 0
Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning
p(w information, c=data) = p(w information) =
p(c=data) =
– 7 – CSCE 771 Spring 2013
Associations
– 8 – CSCE 771 Spring 2013
PMI: More data trumps smarter algorithms
“More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis”Indiana University, 2009http://www.indiana.edu/~clcl/Papers/BSC901.pdf“we demonstrate that this metric • benefits from training on extremely large amounts of
data and • correlates more closely with human semantic
similarity ratings than do publicly available implementations of several more complex models. “
– 9 – CSCE 771 Spring 2013
Figure 20.10 Co-occurrence vectors Based on syntactic dependencies Dependency based parser – special case of shallow Dependency based parser – special case of shallow
parsingparsing identify from “I discovered dried tangerines.” (20.32)identify from “I discovered dried tangerines.” (20.32)
discover(subject I) I(subject-of discover) tangerine(obj-of discover) tangerine(adj-mod dried)
– 10 – CSCE 771 Spring 2013
Defining Context using syntactic info• dependency parsingdependency parsing• chunkingchunking
discover(subject I) -- S NP VP I(subject-of discover) tangerine(obj-of discover) -- VP verb NP tangerine(adj-mod dried) -- NP det ? ADJ N
– 11 – CSCE 771 Spring 2013
Figure 20.11 Objects of the verb drink Hindle 1990 ACL
• frequenciesfrequencies• it, much and anything
more frequent than wine
• PMI-AssocPMI-Assoc• wine more drinkable
Object Count PMI-Assoc
tea 4 11.75
Pepsi 2 11.75
champagne 4 11.75
liquid 2 10.53
beer 5 10.20
wine 2 9.34
water 7 7.65
anything 3 5.15
much 3 2.54
it 3 1.25<some Amounnt> 2 1.22
http://acl.ldc.upenn.edu/P/P90/P90-1034.pdf
– 12 – CSCE 771 Spring 2013
vectors reviewdot-productdot-product
lengthlength
sim-cosinesim-cosine
– 13 – CSCE 771 Spring 2013
Figure 20.12 Similarity of Vectors
– 14 – CSCE 771 Spring 2013
Fig 20.13 Vector Similarity Summary
– 15 – CSCE 771 Spring 2013
Figure 20.14 Hand-built patterns for hypernyms Hearst 1992 Finding hypernyms (IS-A links)Finding hypernyms (IS-A links) (20.58) One example of red algae is Gelidium.(20.58) One example of red algae is Gelidium. one example of *** is a ***one example of *** is a ***
500,000 hits on google
Semantic drift in bootstrappingSemantic drift in bootstrapping
– 16 – CSCE 771 Spring 2013
Hyponym Learning Alg. (Snow 2005)Rely on wordnet to learn large numbers of weak hyponym patternsRely on wordnet to learn large numbers of weak hyponym patterns
Snow’s AlgorithmSnow’s Algorithm
1.1. Collect all pairs of wordnet noun concepts with <cCollect all pairs of wordnet noun concepts with <c ii IS-A c IS-A cjj,>,>
2.2. For each pair collect all sentences containing the pairFor each pair collect all sentences containing the pair
3.3. Parse the sentences and automatically extract every possible Hearst-Parse the sentences and automatically extract every possible Hearst-style syntactic patterns from the parse treestyle syntactic patterns from the parse tree
4.4. Use the large set of patterns as features in a logistic regression classifierUse the large set of patterns as features in a logistic regression classifier
5.5. Given each pair extract features and use the classifier to determine if the Given each pair extract features and use the classifier to determine if the pair is a hypernym/hyponympair is a hypernym/hyponym
New patterns learnedNew patterns learned NPH like NP NP is a NPH
NPH called NP NP, a NPH (appositive)
– 17 – CSCE 771 Spring 2013
Vector Similarities from Lin 1998 hope (N):hope (N):
optimism 0.141, chance 0.137, expectation 0.137, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108
hope(V)hope(V) would like 0.158, wish 0.140. …
brief (N)brief (N) legal brief 0.256, affidavit 0.191, …
brief (A)brief (A) lengthy .256, hour-long 0.191, short 0.174, extended 0.163 …
full lists on page 667full lists on page 667
– 18 – CSCE 771 Spring 2013
Supersenses26 broad-category “lexicograher class” wordnet labels26 broad-category “lexicograher class” wordnet labels
– 19 – CSCE 771 Spring 2013
Figure 20.15 Semantic Role Labelling
– 20 – CSCE 771 Spring 2013
Figure 20.16
– 21 – CSCE 771 Spring 2013
google(Wordnet NLTK)..
– 22 – CSCE 771 Spring 2013
wn01.py# Wordnet examples from nltk.googlecode.com # Wordnet examples from nltk.googlecode.com
import nltkimport nltk
from nltk.corpus import wordnet as wnfrom nltk.corpus import wordnet as wn
motorcar = wn.synset('car.n.01')motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()types_of_motorcar = motorcar.hyponyms()
types_of_motorcar[26]types_of_motorcar[26]
print wn.synset('ambulance.n.01')print wn.synset('ambulance.n.01')
print sorted([lemma.name for synset in types_of_motorcar print sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])for lemma in synset.lemmas])• http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
– 23 – CSCE 771 Spring 2013
wn01.py continuedprint "wn.synsets('dog', pos=wn.VERB)= ", print "wn.synsets('dog', pos=wn.VERB)= ",
wn.synsets('dog', pos=wn.VERB)wn.synsets('dog', pos=wn.VERB)
print wn.synset('dog.n.01')print wn.synset('dog.n.01')
### Synset('dog.n.01')### Synset('dog.n.01')
print wn.synset('dog.n.01').definitionprint wn.synset('dog.n.01').definition
###'a member of the genus Canis (probably ###'a member of the genus Canis (probably descended from the common wolf) that has been descended from the common wolf) that has been domesticated by man since prehistoric times; occurs domesticated by man since prehistoric times; occurs in many breeds'in many breeds'
print wn.synset('dog.n.01').examplesprint wn.synset('dog.n.01').examples
### ['the dog barked all night']### ['the dog barked all night']
– 24 – CSCE 771 Spring 2013
wn01.py continuedprint wn.synset('dog.n.01').lemmasprint wn.synset('dog.n.01').lemmas
###[Lemma('dog.n.01.dog'), ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]Lemma('dog.n.01.Canis_familiaris')]
print [lemma.name for lemma in print [lemma.name for lemma in wn.synset('dog.n.01').lemmas]wn.synset('dog.n.01').lemmas]
### ['dog', 'domestic_dog', 'Canis_familiaris']### ['dog', 'domestic_dog', 'Canis_familiaris']
print wn.lemma('dog.n.01.dog').synsetprint wn.lemma('dog.n.01.dog').synset
– 25 – CSCE 771 Spring 2013
Section 2 synsets, hypernyms, hyponyms# Section 2 Synsets, # Section 2 Synsets, hypernyms, hyponymshypernyms, hyponyms
import nltkimport nltk
from nltk.corpus import from nltk.corpus import wordnet as wnwordnet as wn
dog = wn.synset('dog.n.01')dog = wn.synset('dog.n.01')
print "dog hyperyms=", print "dog hyperyms=", dog.hypernyms()dog.hypernyms()
###dog hyperyms= ###dog hyperyms= [Synset('domestic_animal.n.01')[Synset('domestic_animal.n.01'), Synset('canine.n.02')], Synset('canine.n.02')]
print "dog hyponyms=", print "dog hyponyms=", dog.hyponyms()dog.hyponyms()
print "dog holonyms=", print "dog holonyms=", dog.member_holonyms()dog.member_holonyms()
print "dog.roo_hyperyms=", print "dog.roo_hyperyms=", dog.root_hypernyms()dog.root_hypernyms()
good = wn.synset('good.a.01')good = wn.synset('good.a.01')
###print "good.antonyms()=", ###print "good.antonyms()=", good.antonyms()good.antonyms()
print print "good.lemmas[0].antonyms()=", "good.lemmas[0].antonyms()=", good.lemmas[0].antonyms()good.lemmas[0].antonyms()
– 26 – CSCE 771 Spring 2013
wn03-Lemmas.py### Section 3 Lemmas### Section 3 Lemmaseat = wn.lemma('eat.v.03.eat')eat = wn.lemma('eat.v.03.eat')print eatprint eatprint eat.keyprint eat.keyprint eat.count()print eat.count()print wn.lemma_from_key(eat.key)print wn.lemma_from_key(eat.key)print print wn.lemma_from_key(eat.key).synsewn.lemma_from_key(eat.key).synsettprint print wn.lemma_from_key( 'feeblemindewn.lemma_from_key( 'feebleminded%5:00:00:retarded:00')d%5:00:00:retarded:00')for lemma in for lemma in wn.synset('eat.v.03').lemmas:wn.synset('eat.v.03').lemmas: print lemma, lemma.count()print lemma, lemma.count()
for lemma in wn.lemmas('eat', 'v'):for lemma in wn.lemmas('eat', 'v'):
print lemma, lemma.count()print lemma, lemma.count()
vocal = vocal = wn.lemma('vocal.a.01.vocal')wn.lemma('vocal.a.01.vocal')
print print vocal.derivationally_related_forms(vocal.derivationally_related_forms())
#[Lemma('vocalize.v.02.vocalize')]#[Lemma('vocalize.v.02.vocalize')]
print vocal.pertainyms()print vocal.pertainyms()
#[Lemma('voice.n.02.voice')]#[Lemma('voice.n.02.voice')]
print vocal.antonyms()print vocal.antonyms()
– 27 – CSCE 771 Spring 2013
wn04-VerbFrames.py# Section 4 Verb Frames# Section 4 Verb Framesprint wn.synset('think.v.01').frame_idsprint wn.synset('think.v.01').frame_idsfor lemma in wn.synset('think.v.01').lemmas:for lemma in wn.synset('think.v.01').lemmas: print lemma, lemma.frame_idsprint lemma, lemma.frame_ids print lemma.frame_stringsprint lemma.frame_stringsprint wn.synset('stretch.v.02').frame_idsprint wn.synset('stretch.v.02').frame_ids
for lemma in wn.synset('stretch.v.02').lemmas:for lemma in wn.synset('stretch.v.02').lemmas: print lemma, lemma.frame_idsprint lemma, lemma.frame_ids print lemma.frame_stringsprint lemma.frame_strings
– 28 – CSCE 771 Spring 2013
wn05-Similarity.py### Section 5 Similarity### Section 5 Similarityimport nltkimport nltkfrom nltk.corpus import wordnet as wnfrom nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')dog = wn.synset('dog.n.01')cat = wn.synset('cat.n.01')cat = wn.synset('cat.n.01')print dog.path_similarity(cat)print dog.path_similarity(cat)print dog.lch_similarity(cat)print dog.lch_similarity(cat)print dog.wup_similarity(cat)print dog.wup_similarity(cat)
from nltk.corpus import wordnet_icfrom nltk.corpus import wordnet_icbrown_ic = wordnet_ic.ic('ic-brown.dat')brown_ic = wordnet_ic.ic('ic-brown.dat')semcor_ic = wordnet_ic.ic('ic-semcor.dat')semcor_ic = wordnet_ic.ic('ic-semcor.dat')
– 29 – CSCE 771 Spring 2013
wn05-Similarity.py continuedfrom nltk.corpus import genesisfrom nltk.corpus import genesis
genesis_ic = wn.ic(genesis, False, 0.0)genesis_ic = wn.ic(genesis, False, 0.0)
print dog.res_similarity(cat, brown_ic)print dog.res_similarity(cat, brown_ic)
print dog.res_similarity(cat, genesis_ic)print dog.res_similarity(cat, genesis_ic)
print dog.jcn_similarity(cat, brown_ic)print dog.jcn_similarity(cat, brown_ic)
print dog.jcn_similarity(cat, genesis_ic)print dog.jcn_similarity(cat, genesis_ic)
print dog.lin_similarity(cat, semcor_ic)print dog.lin_similarity(cat, semcor_ic)
– 30 – CSCE 771 Spring 2013
wn06-AccessToAllSynsets.py### Section 6 access to all synsets### Section 6 access to all synsetsimport nltkimport nltkfrom nltk.corpus import wordnet as wnfrom nltk.corpus import wordnet as wn
for synset in list(wn.all_synsets('n'))[:10]:for synset in list(wn.all_synsets('n'))[:10]: print synsetprint synset
wn.synsets('dog')wn.synsets('dog')wn.synsets('dog', pos='v')wn.synsets('dog', pos='v')from itertools import islicefrom itertools import islice
for synset in islice(wn.all_synsets('n'), 5):for synset in islice(wn.all_synsets('n'), 5): print synset, synset.hypernyms()print synset, synset.hypernyms()
– 31 – CSCE 771 Spring 2013
wn07-Morphy.py# Wordnet in NLTK# Wordnet in NLTK# # http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.htmlhttp://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
import nltkimport nltk
from nltk.corpus import wordnet as wnfrom nltk.corpus import wordnet as wn
### Section 7 Morphy### Section 7 Morphy
print wn.morphy('denied', wn.NOUN)print wn.morphy('denied', wn.NOUN)
print wn.synsets('denied', wn.NOUN)print wn.synsets('denied', wn.NOUN)
print wn.synsets('denied', wn.VERB) print wn.synsets('denied', wn.VERB)
– 32 – CSCE 771 Spring 2013
8 Regression TestsBug 85: morphy returns the base form of a word, if it's input is Bug 85: morphy returns the base form of a word, if it's input is
given as a base form for a POS for which that word is not given as a base form for a POS for which that word is not defined:defined:
>>> wn.synsets('book', wn.NOUN)>>> wn.synsets('book', wn.NOUN)
[Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')]Synset('book.n.10'), Synset('book.n.11')]
>>> wn.synsets('book', wn.ADJ)>>> wn.synsets('book', wn.ADJ)
[][]
>>> wn.morphy('book', wn.NOUN)>>> wn.morphy('book', wn.NOUN)
'book''book'
>>> wn.morphy('book', wn.ADJ)>>> wn.morphy('book', wn.ADJ)
– 33 – CSCE 771 Spring 2013
nltk.corpus.reader.wordnet. ic(self, corpus, weight_senses_equally=False, (self, corpus, weight_senses_equally=False,
smoothing=1.0)smoothing=1.0)Creates an information content lookup dictionary from a Creates an information content lookup dictionary from a corpus.corpus.
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#WordNetCorpusReader.ic
def demo(): def demo(): import nltk import nltk print('loading wordnet') print('loading wordnet') wn = wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet'))WordNetCorpusReader(nltk.data.find('corpora/wordnet'))
print('done loading') print('done loading') S = wn.synset S = wn.synset L = wn.lemmaL = wn.lemma
– 34 – CSCE 771 Spring 2013
root_hypernymsdef root_hypernyms(self): def root_hypernyms(self):
"""Get the topmost hypernyms of this synset in """Get the topmost hypernyms of this synset in WordNet.""" WordNet."""
result = [] result = []
seen = set() seen = set()
todo = [self] while todo: todo = [self] while todo:
next_synset = todo.pop() next_synset = todo.pop()
if next_synset not in seen: if next_synset not in seen:
seen.add(next_synset) seen.add(next_synset)
next_hypernyms = next_synset.hypernyms() + …next_hypernyms = next_synset.hypernyms() + …
return resultreturn result