lecture 22 word similarity topics word similarity thesaurus based word similarity intro....

Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 8, 2013 CSCE 771 Natural Language Processing Slide 2 2 CSCE 771 Spring 2013 Overview Last Time (Programming) Features in NLTK NL queries SQL NLTK support for Interpretations and Models Propositional and predicate logic support Prover9Today Last Lectures slides 25-29 Features in NLTK Computational Lexical SemanticsReadings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II Slide 3 3 CSCE 771 Spring 2013 Figure 20.1 Possible sense tags for bass Chapter 20 Word Sense disambiguation (WSD) Machine translation Supervised vs unsupervised learning Semantic concordance corpus with words tagged with sense tags Slide 4 4 CSCE 771 Spring 2013 Feature Extraction for WSD Feature vectors Collocation [w i-2, POS i-2, w i-1, POS i-1, w i, POS i, w i+1, POS i+1, w i+2, POS i+2 ] [w i-2, POS i-2, w i-1, POS i-1, w i, POS i, w i+1, POS i+1, w i+2, POS i+2 ] Bag-of-words unordered set of neighboring words Represent sets of most frequent content words with membership vector [0,0,1,0,0,0,1] set of 3 rd and 7 th most freq. content word Window of nearby words/features Slide 5 5 CSCE 771 Spring 2013 Nave Bayes Classifier w word vector s sense tag vector f feature vector [w i, POS i ] for i=1, n Approximate by frequency counts But how practical? Slide 6 6 CSCE 771 Spring 2013 Looking for Practical formula. Still not practical Slide 7 7 CSCE 771 Spring 2013 Nave == Assume Independence Now practical, but realistic? Slide 8 8 CSCE 771 Spring 2013 Training = count frequencies. Maximum likelihood estimator (20.8) Slide 9 9 CSCE 771 Spring 2013 Decision List Classifiers Nave Bayes hard for humans to examine decisions and understand Decision list classifiers - like case statement sequence of (test, returned-sense-tag) pairs Slide 10 10 CSCE 771 Spring 2013 Figure 20.2 Decision List Classifier Rules Slide 11 11 CSCE 771 Spring 2013 WSD Evaluation, baselines, ceilings Extrinsic evaluation - evaluating embedded NLP in end- to-end applications (in vivo) Intrinsic evaluation WSD evaluating by itself (in vitro) Sense accuracy Corpora SemCor, SENSEVAL, SEMEVAL Baseline - Most frequent sense (wordnet sense 1) Ceiling Gold standard human experts with discussion and agreement Slide 12 12 CSCE 771 Spring 2013 Similarity of Words or Senses generally we will be saying words but giving similarity of word senses similarity vs relatedness ex similarity ex relatedness Similarity of words Similarity of phrases/sentence (not usually done) Slide 13 13 CSCE 771 Spring 2013 Figure 20.3 Simplified Lesk Algorithm gloss/sentence overlap Slide 14 14 CSCE 771 Spring 2013 Simplified Lesk example The bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable rate mortgage securities. Slide 15 15 CSCE 771 Spring 2013 Corpus Lesk Using equals weights on words just does not seem right weights applied to overlap words inverse document frequency idf i = log (N docs / num docs containing w i ) Slide 16 16 CSCE 771 Spring 2013 SENSEVAL competitions http://www.senseval.org/ Check the Senseval-3 website.Senseval-3 Slide 17 17 CSCE 771 Spring 2013 SemEval-2 -Evaluation Exercises on Semantic Evaluation - ACL SigLex eventSigLex SemEval-2 -Evaluation Exercises on Semantic Evaluation - ACL SigLex eventSigLex Slide 18 18 CSCE 771 Spring 2013 Task NameArea #1Coreference Resolution in Multiple Languages Coref #2Cross-Lingual Lexical SubstitutionCross-Lingual, Lexical Substitu #3Cross-Lingual Word Sense DisambiguationCross-Lingual, Word Senses #4VP Ellipsis - Detection and ResolutionEllipsis #5Automatic Keyphrase Extraction from Scientific Articles #6Classification of Semantic Relations between MeSH Entities in Swedish Medical Texts #7Argument Selection and CoercionMetonymy #8Multi-Way Classification of Semantic Relations Between Pairs of Nominals #9Noun Compound Interpretation Using Paraphrasing VerbsNoun compounds #10Linking Events and their Participants in Discourse Semantic Role Labeling, Information Extraction #11Event Detection in Chinese News SentencesSemantic Role Labeling, Word Senses #12Parser Training and Evaluation using Textual Entailment #13TempEval 2Time Expressions #14Word Sense Induction #15Infrequent Sense Identification for Mandarin Text to Speech Systems #16Japanese WSDWord Senses #17All-words Word Sense Disambiguation on a Specific Domain (WSD-domain) #18Disambiguating Sentiment Ambiguous AdjectivesWord Senses, Sentim Slide 19 19 CSCE 771 Spring 2013 20.4.2 Selectional Restrictions and Preferences verb eat theme=object has feature Food+verb eat theme=object has feature Food+ Katz and Fodor 1963 used this idea to rule out senses that were not consistentKatz and Fodor 1963 used this idea to rule out senses that were not consistent WSD of diskWSD of disk (20.12) In out house, evrybody has a career and none of them includes washing dishes, he says. (20.13) In her tiny kitchen, Ms, Chen works efficiently, stir-frying several simple dishes, inlcuding Verbs wash, stir-fryingVerbs wash, stir-frying wash washable+ stir-frying edible+ Slide 20 20 CSCE 771 Spring 2013 Resniks model of Selectional Association How much does a predicate tell you about the semantic class of its arguments? eat eat was, is, to be was, is, to be selectional preference strength of a verb is indicated by two distributions:selectional preference strength of a verb is indicated by two distributions: 1.P(c) how likely the direct object is to be in class c 2.P(c|v) the distribution of expected semantic classes for the particular verb v the greater the difference in these distributions means the verb provides more informationthe greater the difference in these distributions means the verb provides more information Slide 21 21 CSCE 771 Spring 2013 Relative entropy Kullback-Leibler divergence Given two distributions P and Q D(P || Q) = P(x) log (p(x)/Q(x)) (eq 20.16) Selectional preference S R (v) = D( P(c|v) || P(c)) = Slide 22 22 CSCE 771 Spring 2013 Resniks model of Selectional Association Slide 23 23 CSCE 771 Spring 2013 High and Low Selectional Associations Resnik 1996 Selectional Associations Slide 24 24 CSCE 771 Spring 2013 20.5 Minimally Supervised WSD: Bootstrapping supervised and dictionary methods require large hand-built resources bootstrapping or semi-supervised learning or minimally supervised learning to address the no-data problem Start with seed set and grow it. Slide 25 25 CSCE 771 Spring 2013 Yarowsky algorithm preliminaries Idea of bootstrapping: create a larger training set from a small set of seeds Heuritics: senses of bass 1.one sense per collocation in a sentence both senses of bass are not used 2.one sense per discourse Yarowsky showed that of 37,232 examples of bass occurring in a discourse there was only one sense per discourseYarowsky Slide 26 26 CSCE 771 Spring 2013 Yarowsky algorithm Goal: learn a word-sense classifier for a word Input: 0 small seed set of labeled instances of each sense 1.train classifier on seed-set 0, 2.label the unlabeled corpus V 0 with the classifier 3.Select examples delta in V that you are most confident in 4. 1 = 0 + delta 5.Repeat Slide 27 27 CSCE 771 Spring 2013 Figure 20.4 Two senses of plant Plant 1 manufacturing plant plant 2 flora, plant life Slide 28 28 CSCE 771 Spring 2013 2009 Survey of WSD by Navigili, iroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf Slide 29 29 CSCE 771 Spring 2013 Figure 20.5 Samples of bass-sentences from WSJ (Wall Street Journal) Slide 30 30 CSCE 771 Spring 2013 Word Similarity: Thesaurus Based Methods Figure 20.6 Path Distances in hierarchy Wordnet of course (pruned) Slide 31 31 CSCE 771 Spring 2013 Figure 20.6 Path Based Similarity.\ sim path (c 1, c 2 )= 1/pathlen(c 1, c 2 ) (length + 1) Slide 32 32 CSCE 771 Spring 2013 WN -hierarchy # Wordnet examples from NLTK book import nltk from nltk.corpus import wordnet as wn right = wn.synset('right_whale.n.01') orca = wn.synset('orca.n.01') minke = wn.synset('minke_whale.n.01') tortoise = wn.synset('tortoise.n.01') novel = wn.synset('novel.n.01') print "LCS(right, minke)=",right.lowest_common _hypernyms(minke) print "LCS(right, orca)=",right.lowest_common_h ypernyms(orca) print "LCS(right, tortoise)=",right.lowest_commo n_hypernyms(tortoise) print "LCS(right, novel)=", right.lowest_common_hyperny ms(novel) Slide 33 33 CSCE 771 Spring 2013 #path similarity print "Path similarities" print right.path_similarity(minke) print right.path_similarity(orca) print right.path_similarity(tortoise) print right.path_similarity(novel) Path similarities 0.250.1666666666670.07692307692310.0434782608696 Slide 34 34 CSCE 771 Spring 2013 Wordnet in NLTK http://nltk.org/_modules/nltk/corpus/reader/wordnet.html http://nltk.org/_modules/nltk/corpus/reader/wordnet.html http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.ht ml (partially in Chap 02 NLTK book; but different version) http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.ht ml http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.ht ml http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py code for similarity runs for a while; lots of results x Slide 35 35 CSCE 771 Spring 2013 https://groups.google.com/forum beautiful 3/4/10 Hi, I was wondering if it is possible for me to use NLTK + wordnet to group (nouns) words together via similar meanings? Assuming I have 2000 words or topics. Is it possible for me to group them together according to similar meanings using NLTK? So that at the end of the day I would have different groups of words that are similar in meaning? Can that be done in NLTK? and possibly be able to detect salient patterns emerging? (trend in topics etc...). Is there a further need for a word classifier based on the CMU BOW toolkit to classify words to get it into categories? or the above group would be good enough? Is there a need to classify words further? How would one classify words in NLTK effectively? Really hope you can enlighten me? FM Slide 36 36 CSCE 771 Spring 2013 Response from Steven Bird Steven Bird 3/7/10 2010/3/5 Republic : > Assuming I have 2000 words or topics. Is it possible for me to group > them together according to similar meanings using NLTK? You could compute WordNet similarity (pairwise), so that each word/topic is represented as a vector of distances, which could then be discretized, so each vector would have a form like this: [0,2,3,1,0,0,2,1,3,...]. These vectors could then be clustered using one of the methods in the NLTK cluster package. > So that at the end of the day I would have different groups of words > that are similar in meaning? Can that be done in NLTK? and possibly be > able to detect salient patterns emerging? (trend in topics etc...). This suggests a temporal dimension, which might mean recomputing the clusters as more words or topics come in. It might help to read the NLTK book sections on WordNet and on text classification, and also some of the other cited material. -Steven Bird Slide 37 37 CSCE 771 Spring 2013 More general? Stack-Overflow import nltk from nltk.corpus import wordnet as wn waiter = wn.synset('waiter.n.01') employee = wn.synset('employee.n.01') all_hyponyms_of_waiter = list(set([w.replace("_"," ") for s in waiter.closure(lambda s:s.hyponyms()) for s in waiter.closure(lambda s:s.hyponyms()) for w in s.lemma_names])) for w in s.lemma_names])) all_hyponyms_of_employee = if 'waiter' in all_hyponyms_of_employee: print 'employee more general than waiter' print 'employee more general than waiter' elif 'employee' in all_hyponyms_of_waiter: print 'waiter more general than employee' print 'waiter more general than employee'else: http://stackoverflow.com/questions/...-semantic-hierarchies-relations-in--nltk Slide 38 38 CSCE 771 Spring 2013 print wn(help) | res_similarity(self, synset1, synset2, ic, verbose=False) | res_similarity(self, synset1, synset2, ic, verbose=False) | Resnik Similarity: | Resnik Similarity: | Return a score denoting how similar two word senses are, based on the | Return a score denoting how similar two word senses are, based on the | Information Content (IC) of the Least Common Subsumer (most specific | Information Content (IC) of the Least Common Subsumer (most specific | ancestor node). | ancestor node). http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py Slide 39 39 CSCE 771 Spring 2013 Similarity based on a hierarchy (=ontology) Slide 40 40 CSCE 771 Spring 2013 Information Content word similarity Slide 41 41 CSCE 771 Spring 2013 Resnick Similarity / Wordnet sim resnick (c1, c2) = -log P(LCS(c1, c2))\ wordnet res_similarity(self, synset1, synset2, ic, verbose=False) | Resnik Similarity: | Resnik Similarity: | Return a score denoting how similar two word senses are, based on the | Return a score denoting how similar two word senses are, based on the | Information Content (IC) of the Least Common Subsumer (most specific | Information Content (IC) of the Least Common Subsumer (most specific | ancestor node). | ancestor node). Slide 42 42 CSCE 771 Spring 2013 Fig 20.7 Wordnet with Lin P(c) values Change for Resnick!! Slide 43 43 CSCE 771 Spring 2013 Lin variation 1998 Commonality Commonality Difference Difference IC(description(A,B)) IC(common(A,B))IC(description(A,B)) IC(common(A,B)) sim Lin (A,B) = Common(A,B) / description(A,B)sim Lin (A,B) = Common(A,B) / description(A,B) Slide 44 44 CSCE 771 Spring 2013 Fig 20.7 Wordnet with Lin P(c) values Slide 45 45 CSCE 771 Spring 2013 Extended Lesk based on 1. glosses 2. glosses of hypernyms, hyponymsExample drawing paper: paper that is specially prepared for use in draftingdrawing paper: paper that is specially prepared for use in drafting decal: the art of transferring designs from specially prepared paper to a wood, glass or metal surface.decal: the art of transferring designs from specially prepared paper to a wood, glass or metal surface. Lesk score = sum of squares of lengths of common phrasesLesk score = sum of squares of lengths of common phrases Example: 1 + 2 2 = 5Example: 1 + 2 2 = 5 Slide 46 46 CSCE 771 Spring 2013 Figure 20.8 Summary of Thesaurus Similarity measures Slide 47 47 CSCE 771 Spring 2013 Wordnet similarity functions path_similarity()? lch_similarity()? lch_similarity()? wup_similarity()? wup_similarity()? res_similarity()? res_similarity()? jcn_similarity()? jcn_similarity()? lin_similarity()? lin_similarity()? Slide 48 48 CSCE 771 Spring 2013 Problems with thesaurus-based dont always have a thesaurus Even so problems with recall missing words phrases missing thesauri work less well for verbs and adjectives less hyponymy structure Distributional Word Similarity D. Jurafsky Slide 49 49 CSCE 771 Spring 2013 Distributional models of meaning vector-space models of meaning offer higher recall than hand-built thesauri less precision probably Distributional Word Similarity D. Jurafsky Slide 50 50 CSCE 771 Spring 2013 Word Similarity Distributional Methods 20.31 tezguino example A bottle of tezguino is on the table.A bottle of tezguino is on the table. Everybody likes tezguino.Everybody likes tezguino. tezguino makes you drunk.tezguino makes you drunk. We make tezguino out of corn.We make tezguino out of corn. What do you know about tezguino?What do you know about tezguino? Slide 51 51 CSCE 771 Spring 2013 Term-document matrix Collection of documents Identify collection of important terms, discriminatory terms(words) Matrix: terms X documents term frequency tf w,d = each document a vector in Z V : Z= integers; N=natural numbers more accurate but perhaps misleading Example Distributional Word Similarity D. Jurafsky Slide 52 52 CSCE 771 Spring 2013 Example Term-document matrix Subset of terms = {battle, soldier, fool, clown} Distributional Word Similarity D. Jurafsky As you like it12 th NightJulius CaesarHenry V Battle11815 Soldier221236 fool375815 clown611700 Slide 53 53 CSCE 771 Spring 2013 Figure 20.9 Term in context matrix for word similarity window of 20 words 10 before 10 after from Brown corpus Slide 54 54 CSCE 771 Spring 2013 Pointwise Mutual Information td-idf (inverse document frequency) rating instead of raw countstd-idf (inverse document frequency) rating instead of raw counts idf intuition again pointwise mutual information (PMI)pointwise mutual information (PMI) Do events x and y occur more than if they were independent? PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) PMI between wordsPMI between words Positive PMI between two words (PPMI)Positive PMI between two words (PPMI) Slide 55 55 CSCE 771 Spring 2013 Computing PPMI Matrix with W (words) rows and C (contexts) columns f ij is frequency of w i in c j, Slide 56 56 CSCE 771 Spring 2013 Example computing PPMI. Slide 57 57 CSCE 771 Spring 2013 Figure 20.10 Slide 58 58 CSCE 771 Spring 2013 Figure 20.11 Slide 59 59 CSCE 771 Spring 2013 Figure 20.12 Slide 60 60 CSCE 771 Spring 2013 Figure 20.13 Slide 61 61 CSCE 771 Spring 2013 Figure 20.14 Slide 62 62 CSCE 771 Spring 2013 Figure 20.15 Slide 63 63 CSCE 771 Spring 2013 Figure 20.16 Slide 64 64 CSCE 771 Spring 2013 http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk NLTK 3.0a1 released : February 2013 This version adds support for NLTKs graphical user interfaces. http://nltk.org/nltk3-alpha/ This version adds support for NLTKs graphical user interfaces. http://nltk.org/nltk3-alpha/ which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguisticshttp://en.wikipedia.org/wiki/Portal:Linguisticshttp://en.wikipedia.org/wiki/Yarowsky_algorithmhttp://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

lecture 22 word similarity topics word similarity thesaurus based word similarity intro....

Documents

practical slide

sense tags slide

sensetag pairs slide

frequent sense wordnet

frequent content words

nltk book chapter

possible sense tags

wsd evaluation