word meaning and similarity - texas a&m...

50
Word Meaning and Similarity Word Senses and Word Relations Slides are adapted from Dan Jurafsky

Upload: dodan

Post on 10-Apr-2018

223 views

Category:

Documents


7 download

TRANSCRIPT

WordMeaningandSimilarity

WordSenses andWordRelations

SlidesareadaptedfromDanJurafsky

Reminder:lemmaandwordform

• Alemma orcitationform• Samestem,partofspeech,roughsemantics

• Awordform• The“inflected”wordasitappearsintext

Wordform Lemmabanks banksung singduermes dormir

Lemmashavesenses

• Onelemma“bank”canhavemanymeanings:• …a bank can hold the investments in a custodial account…

• “…as agriculture burgeons on the east bank the river will shrink even more”

• Sense(orwordsense)• Adiscreterepresentation

ofanaspectofaword’smeaning.

• Thelemmabank herehastwosenses

1

2

Sense1:

Sense2:

Homonymy

Homonyms:wordsthatshareaformbuthaveunrelated,distinctmeanings:

• bank1:financialinstitution,bank2:slopingland• bat1:clubforhittingaball,bat2:nocturnalflyingmammal

1. Homographs (bank/bank,bat/bat)2. Homophones:

1. Write andright2. Piece andpeace

HomonymycausesproblemsforNLPapplications

• Informationretrieval• “bat care”

• MachineTranslation• bat:murciélago (animal)orbate (forbaseball)

• Text-to-Speech• bass (stringedinstrument)vs.bass (fish)

Polysemy

• 1.Thebankwasconstructedin1875outoflocalredbrick.• 2.Iwithdrewthemoneyfromthebank• Arethosethesamesense?

• Sense2:“Afinancialinstitution”• Sense1:“Thebuildingbelongingtoafinancialinstitution”

• Apolysemous wordhasrelatedmeanings• Mostnon-rarewordshavemultiplemeanings

• Lotsoftypesofpolysemyaresystematic• School, university, hospital• Allcanmeantheinstitutionorthebuilding.

• Asystematicrelationship:• Building Organization

• Othersuchkindsofsystematicpolysemy:Author (Jane Austen wrote Emma)

WorksofAuthor(I love Jane Austen)Tree (Plums have beautiful blossoms)

Fruit (I ate a preserved plum)

MetonymyorSystematicPolysemy:Asystematicrelationshipbetweensenses

Howdoweknowwhenawordhasmorethanonesense?

• The“zeugma”test:Twosensesofserve?• Which flights serve breakfast?• Does Lufthansa serve Philadelphia?• ?DoesLufthansaservebreakfastandSanJose?

• Sincethisconjunctionsoundsweird,• wesaythatthesearetwodifferentsensesof“serve”

Synonyms• Wordthathavethesamemeaninginsomeorallcontexts.

• filbert/hazelnut• couch/sofa• big/large• automobile/car• vomit/throwup• Water/H20

• Twolexemesaresynonyms• iftheycanbesubstitutedforeachotherinallsituations• Ifsotheyhavethesamepropositionalmeaning

Synonyms

• Buttherearefew(orno)examplesofperfectsynonymy.• Evenifmanyaspectsofmeaningareidentical• Stillmaynotpreservetheacceptabilitybasedonnotionsofpoliteness,slang,register,genre,etc.

• Example:• Water/H20• Big/large• Brave/courageous

Synonymyisarelationbetweensensesratherthanwords

• Considerthewordsbig andlarge• Aretheysynonyms?

• Howbig isthatplane?• WouldIbeflyingonalarge orsmallplane?

• Howabouthere:• MissNelson becameakindofbigsistertoBenjamin.• ?MissNelson becameakindoflarge sistertoBenjamin.

• Why?• big hasasensethatmeansbeingolder,orgrownup• large lacksthissense

Antonyms

• Sensesthatareoppositeswithrespecttoonefeatureofmeaning• Otherwise,theyareverysimilar!

dark/light short/long fast/slow rise/fallhot/cold up/down in/out

• Moreformally:antonymscan• defineabinaryopposition

orbeatoppositeendsofascale• long/short, fast/slow

• Bereversives:• rise/fall, up/down

HyponymyandHypernymy

• Onesenseisahyponym ofanotherifthefirstsenseismorespecific,denotingasubclassoftheother• car isahyponymofvehicle• mango isahyponymoffruit

• Converselyhypernym/superordinate (“hyperissuper”)• vehicle isahypernym ofcar• fruit isahypernym ofmango

Superordinate/hyper vehicle fruit furnitureSubordinate/hyponym car mango chair

Hyponymymoreformally• Extensional:

• Theclassdenotedbythesuperordinateextensionallyincludestheclassdenotedbythehyponym

• Entailment:• AsenseAisahyponymofsenseBifbeinganAentailsbeingaB

• Hyponymyisusuallytransitive• (AhypoBandBhypoCentailsAhypoC)

• Anothername:theIS-Ahierarchy• AIS-A B(orAISA B)• Bsubsumes A

HyponymsandInstances

• WordNet hasbothclasses andinstances.• Aninstance isanindividual,apropernounthatisauniqueentity

• San Francisco isaninstance ofcity• Butcity isaclass• city isahyponym ofmunicipality...location...

15

WordMeaningandSimilarity

WordSenses andWordRelations

WordMeaningandSimilarity

WordNet andotherOnlineThesauri

ApplicationsofThesauriandOntologies

• InformationExtraction• InformationRetrieval• QuestionAnswering• Bioinformaticsand MedicalInformatics• MachineTranslation

WordNet 3.0

• Ahierarchicallyorganizedlexicaldatabase• On-linethesaurus+aspectsofadictionary

• Someotherlanguagesavailableorunderdevelopment• (Arabic,Finnish,German,Portuguese…)

Category UniqueStringsNoun 117,798Verb 11,529Adjective 22,479Adverb 4,481

Sensesof“bass”inWordnet

Howis“sense”definedinWordNet?• The synset (synonymset),thesetofnear-synonyms,

instantiatesasenseorconcept,withagloss• Example:chumpasanounwiththegloss:

“apersonwhoisgullibleandeasytotakeadvantageof”

• Thissenseof“chump”issharedby9words:chump1, fool2, gull1, mark9, patsy1, fall guy1, sucker1, soft touch1, mug2

• Eachofthese senseshavethissamegloss• (Notevery sense;sense2ofgullistheaquaticbird)

WordNet Hypernym Hierarchyfor“bass”

WordNet NounRelations

WordNet 3.0

• Whereitis:• http://wordnetweb.princeton.edu/perl/webwn

• Libraries• Python:WordNet fromNLTK• http://www.nltk.org/Home

• Java:• JWNL,extJWNL onsourceforge

Synset

• MeSH (MedicalSubjectHeadings)• 177,000entrytermsthatcorrespondto26,142biomedical“headings”

• HemoglobinsEntryTerms:Eryhem, FerrousHemoglobin,HemoglobinDefinition:Theoxygen-carryingproteinsofERYTHROCYTES.Theyarefoundinallvertebratesandsomeinvertebrates.Thenumberofglobinsubunitsinthehemoglobinquaternarystructurediffersbetweenspecies.Structuresrangefrommonomerictoavarietyofmultimeric arrangements

MeSH:MedicalSubjectHeadingsthesaurusfromtheNationalLibraryofMedicine

TheMeSH Hierarchy

• a

26

UsesoftheMeSH Ontology

• Providesynonyms(“entryterms”)• E.g.,glucoseanddextrose

• Providehypernyms (fromthehierarchy)• E.g.,glucoseISAmonosaccharide

• IndexinginMEDLINE/PubMED database• NLM’sbibliographicdatabase:• 20millionjournalarticles• Eacharticlehand-assigned10-20MeSH terms

WordMeaningandSimilarity

WordNet andotherOnlineThesauri

WordMeaningandSimilarity

WordSimilarity:ThesaurusMethods

WordSimilarity

• Synonymy:abinaryrelation• Twowordsareeithersynonymousornot

• Similarity(or distance):aloosermetric• Twowordsaremoresimilariftheysharemorefeaturesofmeaning

• Similarityisproperlyarelationbetweensenses• Theword“bank”isnotsimilartotheword“slope”• Bank1 issimilartofund3

• Bank2 issimilartoslope5

• Butwe’llcomputesimilarityoverbothwordsandsenses

Whywordsimilarity

• Informationretrieval• Questionanswering• Machinetranslation• Naturallanguagegeneration• Languagemodeling• Automaticessaygrading• Plagiarismdetection• Documentclustering

Wordsimilarityandwordrelatedness

• Weoftendistinguishwordsimilarity fromwordrelatedness• Similar words:near-synonyms• Relatedwords:canberelatedanyway• car, bicycle: similar• car, gasoline: related,notsimilar

Twoclassesofsimilarityalgorithms

• Thesaurus-basedalgorithms• Arewords“nearby”inhypernym hierarchy?• Dowordshavesimilarglosses(definitions)?

• Distributionalalgorithms• Dowordshavesimilardistributionalcontexts?

Pathbasedsimilarity

• Twoconcepts(senses/synsets)aresimilariftheyareneareachotherinthethesaurushierarchy• =haveashortpathbetweenthem• conceptshavepath1tothemselves

Refinementstopath-basedsimilarity

• pathlen(c1,c2) =1+numberofedgesintheshortestpathinthehypernym graphbetweensensenodesc1 andc2

• rangesfrom0to1(identity)

• simpath(c1,c2) =

• wordsim(w1,w2) = max simpath(c1,c2)c1Îsenses(w1),c2Îsenses(w2)

1pathlen(c1,c2 )

Example:path-basedsimilaritysimpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin)=1/2 = .5simpath(fund,budget)=1/2 = .5simpath(nickel,currency)=1/4 = .25simpath(nickel,money)=1/6 = .17simpath(coinage,Richter scale)=1/6 = .17

Problemwithbasicpath-basedsimilarity

• Assumeseachlinkrepresentsauniformdistance• Butnickel tomoney seemstoustobecloserthannickel tostandard

• Nodeshighinthehierarchyareveryabstract• Weinsteadwantametricthat

• Representsthecostofeachedgeindependently• Wordsconnectedonlythroughabstractnodes• arelesssimilar

Informationcontentsimilaritymetrics

• Let’sdefineP(c) as:• Theprobabilitythatarandomlyselectedwordinacorpusisaninstanceofconceptc

• Formally:thereisadistinctrandomvariable,rangingoverwords,associatedwitheachconceptinthehierarchy• foragivenconcept,eachobservednouniseither

• amemberofthatconceptwithprobabilityP(c)• notamemberofthatconceptwithprobability1-P(c)

• Allwordsaremembersoftherootnode(Entity)• P(root)=1

• Theloweranodeinhierarchy,theloweritsprobability

Resnik 1995.Usinginformationcontenttoevaluatesemanticsimilarityinataxonomy.IJCAI

Informationcontentsimilarity

• Trainbycountinginacorpus• Eachinstanceofhill countstowardfrequencyofnaturalelevation,geologicalformation,entity,etc• Letwords(c) bethesetofallwordsthatarechildrenofnodec

• words(“geo-formation”)= {hill,ridge,grotto,coast,cave,shore,natural elevation}• words(“naturalelevation”)={hill,ridge}

P(c) =count(w)

w∈words(c)∑

N

geological-formation

shore

hill

naturalelevation

coast

cave

grottoridge

entity

Informationcontentsimilarity• WordNet hierarchyaugmentedwithprobabilitiesP(c)

D.Lin.1998.AnInformation-TheoreticDefinitionofSimilarity.ICML1998

Informationcontent:definitions

• Informationcontent:IC(c) = -log P(c)

• Mostinformativesubsumer(Lowestcommonsubsumer)LCS(c1,c2) = Themostinformative(lowest)nodeinthehierarchysubsumingbothc1 andc2

Usinginformationcontentforsimilarity:theResnik method

• Thesimilaritybetweentwowordsisrelatedtotheircommoninformation

• Themoretwowordshaveincommon,themoresimilartheyare

• Resnik:measurecommoninformationas:• Theinformationcontentofthemostinformative(lowest)subsumer (MIS/LCS)ofthetwonodes

• simresnik(c1,c2) = -log P( LCS(c1,c2) )

PhilipResnik.1995.UsingInformationContenttoEvaluateSemanticSimilarityinaTaxonomy.IJCAI1995.PhilipResnik.1999.SemanticSimilarityinaTaxonomy:AnInformation-BasedMeasureanditsApplicationtoProblemsofAmbiguityinNaturalLanguage.JAIR11,95-130.

Dekang Linmethod

• Intuition:SimilaritybetweenAandBisnotjustwhattheyhaveincommon

• Themoredifferences betweenAandB,thelesssimilartheyare:• Commonality:themoreAandBhaveincommon,themoresimilartheyare• Difference:themoredifferencesbetweenAandB,thelesssimilar

• Commonality:IC(common(A,B))• Difference:IC(description(A,B))-IC(common(A,B)

Dekang Lin.1998.AnInformation-TheoreticDefinitionofSimilarity.ICML

Dekang Linsimilaritytheorem• ThesimilaritybetweenAandBismeasuredbytheratio

betweentheamountofinformationneededtostatethecommonalityofAandBandtheinformationneededtofullydescribewhatAandBare

simLin(A,B)∝IC(common(A,B))IC(description(A,B))

• Lin(alteringResnik)definesIC(common(A,B))as2xinformationoftheLCS

simLin(c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

Linsimilarityfunction

simLin(A,B) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

simLin(hill, coast) =2 logP(geological-formation)logP(hill)+ logP(coast)

=2 ln0.00176

ln0.0000189+ ln0.0000216= .59

The(extended)Lesk Algorithm

• Athesaurus-basedmeasurethatlooksatglosses• Twoconceptsaresimilariftheirglossescontainsimilarwords

• Drawingpaper:paper thatisspeciallypreparedforuseindrafting• Decal:theartoftransferringdesignsfromspeciallypreparedpaper toawoodorglassormetalsurface

• Foreachn-wordphrasethat’sinbothglosses• Addascoreofn2

• Paperandspeciallypreparedfor1+22 =5• Computeoverlapalsoforotherrelations• glossesofhypernyms andhyponyms

Summary:thesaurus-basedsimilarity

simpath (c1,c2 ) =1

pathlen(c1,c2 )

simresnik (c1,c2 ) = − logP(LCS(c1,c2 )) simlin (c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

sim jiangconrath (c1,c2 ) =1

logP(c1)+ logP(c2 )− 2 logP(LCS(c1,c2 ))

simeLesk (c1,c2 ) = overlap(gloss(r(c1)),gloss(q(c2 )))r,q∈RELS∑

Librariesforcomputingthesaurus-basedsimilarity

• NLTK• http://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity-nltk.corpus.reader.WordNetCorpusReader.res_similarity

• WordNet::Similarity• http://wn-similarity.sourceforge.net/• Web-basedinterface:

• http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi

48

Evaluatingsimilarity• Extrinsic(task-based,end-to-end)Evaluation:

• QuestionAnswering• SpellChecking• Essaygrading

• IntrinsicEvaluation:• Correlationbetweenalgorithm andhumanwordsimilarityratings• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77

• TakingTOEFLmultiple-choicevocabularytests• Levied is closest in meaning to:imposed, believed, requested, correlated

WordMeaningandSimilarity

WordSimilarity:ThesaurusMethods