word sense determination from wikipedia data using neural ... · in proceedings of the joint...
Post on 19-Jun-2020
1 Views
Preview:
TRANSCRIPT
WordSenseDeterminationfromWikipediaDataUsing
NeuralNetworks
AdvisorDr. Chris Pollett
Committee MembersDr. JonPearceDr. Suneuy Kim
ByQiaoLiu
Agenda
• Introduction• Background• ModelArchitecture• DataSetsandDataPreprocessing• Implementation• ExperimentsandDiscussions• ConclusionandFutureWork
Introduction
• Wordsensedisambiguationisthetaskofidentifyingwhichsenseofanambiguouswordisusedinasentence.
in1890,hebecamecustodianoftheMilwaukeepublicmuseumwherehecollectedplant specimensfortheirgreenhouse
…...sendcollectedfluidtoamunicipalsewagetreatmentplant oracommercialwastewatertreatmentfacility
• Wordsensedisambiguationisusefulinnaturallanguageprocessingtasks,suchasspeechsynthesis,questionanswering,andmachinetranslation.
Introduction
Sensediscrimination Senselabeling
Sensediscrimination Senselabeling
WordSenseDisambiguation
Lexicalsampletask
All-wordstaskProjectpurpose
• Twovariantsofwordsensedisambiguationtask:
lexicalsampletaskall-wordstask
• Twosubtasks:sensediscriminationsenselabeling
Introduction
Sensediscrimination Senselabeling
Sensediscrimination Senselabeling
WordSenseDisambiguation
Lexicalsampletask
All-wordstaskProjectpurpose
• Twovariantsofwordsensedisambiguationtask:
lexicalsampletaskall-wordstask
• Twosubtasks:sensediscriminationsenselabeling
Background
ExistingWork
Background
Approach1:Dictionary-based
Givenatargetwordt tobedisambiguatedinContextc.1. retrieveallthesensedefinitionsfortfromadictionary.2. selectthesenseswhosedefinitionhavethemostoverlapwithcoft.
• Thisapproachrequiresahand-builtmachinereadablesemanticsensedictionary.
Background
Approach2:Supervisedmachinelearning
1. Extractasetoffeaturesfromthecontextofthetargetword.2. Usethefeaturetotrainclassifiersthatcanlabelambiguouswordsin
newtext.
• Thisapproachrequirescostlylargehand-builtresources,becauseeachambiguouswordneedbelabelledintrainingdata.
• Asemi-supervisedapproachwasproposedin1995byYarowsky.Inthisapproach,theydonotrelyonalargehand-builtdata,duetousingbootstrappingtogeneratedictionaryfromasmallhand-labeledseed-set.
Background
Approach3:Unsupervisedmachinelearning
Interpretthesenseoftheambiguouswordasclustersofsimilarcontexts.Contextsandwordsarerepresentedbyahigh-dimensional,real-valuedvectorusingco-occurrencecounts.
• Inourproject,weuseamodificationofthisapproach:• Wordembeddings aretrainedusingWikipediapages.• Wordvectorsofcontextscomputedbytheseembeddingarethenclustered.• Givenanewwordtodisambiguate,weuseitscontextandtheword
embeddingtofindawordvectorcorrespondingtothiscontext.Thenwedeterminetheclusteritbelongs.
• Inrelatedwork,Schütze usedadatasettakenfromtheNewYorkTimesNewsService anddidclusteringbutwithadifferentkindofwordvector.
Background
• Wordembeddings
Awordembeddingisaparameterizedfunctionmappingwordsinsomelanguagetohigh-dimensionalvectors(perhaps200to500dimensions)
word→𝑅"W(“plant”)=[0.3,-0.2,0.7,…]W(“crane”)=[0.5,0.4-0.6,…]
ModelArchitecture
• ManyNLPtaskstaketheapproachoffirstlearningagoodwordrepresentationonataskandthenusingthatrepresentationforothertasks.Weusedthisapproachforthewordsensedeterminationtask.
ModelArchitecture
• Learnagoodwordrepresentationofataskandthenusingthatrepresentationforothertasks.
• WeusedtheSkip-grammodelastheneuralnetworklanguagemodellayer
ModelArchitecture
Skip-gramModelArchitecture• Thetrainingobjectivewastolearnwordembeddings goodatpredictingthe
contextwordsinasentence.• Wetrainedtheneuralnetworkbyfeedingitwordpairsoftargetwordand
contextwordfoundinourtrainingdataset.
𝐽$ 𝜃 = ( ( 𝑝(𝑤,-.|𝑤,; 𝜃1�
345.54.67
8
,9:
𝐽 𝜃 = −1𝑉> > log( 𝑝(𝑤,-.|𝑤,; 𝜃)1
�
345.54.67
8
,9:
𝑝 𝑤C 𝑤, =ex p(𝑤CG𝑤,)
∑ ex p(𝑤.G𝑤,18.9:
• k-meansclustering
k-meansisasimpleunsupervisedclassificationalgorithm.Theaimofthek-meansalgorithmistodividempointsinndimensionsintokclusterssothatthewithin-clustersumofsquaresisminimize
Thedistributionalhypothesissaysthatsimilarwordsappearinsimilarcontexts[9,10].Thus,wecanusek-meanstodivideallvectorsofcontextintokclusters.
ModelArchitecture
• Datasourcehttps://dumps.wikimedia.org/enwiki/20170201/Thepages-articles.xml ofWikipediadatadumpcontainscurrentversionofallarticlepages,templates,andotherpages.
• TrainingdataformodelWordpairs:(targetword,contextword)
DataSetsandDataPreprocessing
Sentence Trainingsamples (windowsize=2)
natural languageprocessingprojectsarefun (natural,language), (natural,processing)
naturallanguage processingprojectsarefun (language,natural), (language,processing), (language,projects)
naturallanguageprocessing projectsarefun (processing,natural), (processing,language), (processing,projects)
naturallanguageprocessingprojects arefun (projects,language), (projects,processing), (projects,are), (projects,fun)
naturallanguageprocessingprojectsare fun (are,processing), (are,project), (are,fun)
naturallanguageprocessingprojectsarefun (fun,projects), (fun,are)
DataSetandDataPreprocessing
Stepstoprocessdata:• Extracted90Msentences
• Countedwords,createdadictionaryandareverseddictionary
• Regeneratedsentences
• Created5Bwordpairs
Implementation
Theoptimizer:• Gradientdescent findstheminimumofafunctionbytakingsteps
proportionaltothe positive ofthegradient.Ineachiterationofgradientdescent,weneedtocalculateallexamples.
• Insteadofcomputingthegradientofthewholetrainingset,eachiterationofstochasticgradientdescent onlyestimatesthisgradientbasedonabatchofrandomlypickedexamples.
Weusedstochasticgradientdescenttooptimizethevectorrepresentationduringtraining.
Implementation
Theparameters:Parameters Meaning
VOC_SIZE Thevocabularysize.
SKIP_WINDOW Thewindowsizeoftextwordsaroundtargetword.
NUM_SKIPS Thenumberofcontextwords,whichwillberandomlytooktogeneratewordpairs.
EMBEDDING_SIZE Thenumberofparametersinthewordembedding.Thesizeofthewordvector.
LR Thelearningrateofgradientdescent
BATCH_SIZE Thesizeofeachbatchinstochasticgradientdescent.Runningonebatch isonestep.
NUM_STEPS Thenumberoftrainingstep.
NUM_SAMPLE Thenumberofnegativesamples.
Implementation
Toolsandpackages:
• TensorFlow r1.4• TensorBoard 0.1.6• Python2.7.10• WikipediaExtractorv2.55• sklearn.cluster [15]• numpy
ExperimentsandDiscussions
TheexperimentalresultsarecomparedwithSchütze’sunsupervisedlearningapproachin1998:• Schütze usedadataset(435M)takenfromtheNewYork
TimesNewsService.WeusedthedatasetextractedfromWikipediapages(12G).
• Schütze usedco-occurrencecountstogeneratevectors,whichhadlargenumbersofvectordimension(1,000/2,000).WeusedtheSkip-grammodeltolearnadistributedwordrepresentationwithadimensionof250.
• Schütze appliedsingular-valuedecompositionduetolargenumbersofvectordimension.Takingadvantageofasmallernumberofdimension,wedidnotneedtoperformmatrixdecomposition.
• WeexperimentedtheSkip-grammodelwithdifferentparametersandselectedonewordembeddingforclustering.
• Skip-grammodelparameters
ExperimentsandDiscussions
Experimentwithskip-grammodel• Used“averageloss”toestimatetheloss
overevery100Kbatches.• Visualizedsomewords’nearestwords.
ExperimentsandDiscussions
Experimentwithclassifyingwordsenses• Clusteredthecontextsoftheoccurrencesofgivenambiguouswordinto
two/threecoherentgroups.• Manuallyassignedlabelstotheoccurrencesofambiguouswordsinthetest
corpus,andcomparethemwithmachinelearnedlabelstocalculateaccuracy.• Beforewordsensedetermination,weassignedalloccurrencestothemost
frequentmeaning,andusedthefractionasthebaseline.
𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑤𝑖𝑡ℎ𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑚𝑎𝑐ℎ𝑖𝑛𝑒𝑙𝑒𝑎𝑟𝑛𝑒𝑑𝑠𝑒𝑛𝑠𝑒𝑙𝑎𝑏𝑒𝑙𝑇ℎ𝑒𝑡𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑡𝑒𝑠𝑡𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
accuracy =
ExperimentsandDiscussions
• “Schütze’s baseline”columngivesthefractionofthemostfrequentsenseinhisdatasets.
• “Schütze’s accuracy”columngivestheresultsofhisdisambiguationexperimentswithlocaltermsfrequencyifapplicable.
• Wegotbetteraccuracyoutofexperimentswith“capital”and“plant”.
• However,themodelcannotdeterminethesensesofword“interest”and“sake”,whichhasabaselineover85%inourdatasets.
ExperimentsandDiscussions
Discussions• Ourdatasets(12G)aremuchlargerthanSchütze’s datasets(435M).
Forexample,thesizeofhistrainingsetforword“capital”is13,015,andoursis179,793.Thelargerdatasetsmighthavehelpedtoincreasetheaccuracyforsomewords.
• Wealsoobservedthatwhenthebaselineishigh(>=85%),themodelcannotdeterminethesensesoftheword.Theperformanceofunsupervisedlearningreliesonsufficientinformationfromthetrainingdata.However,themodeldidn’tgettrainedwithsufficientdatacarryinglessfrequentmeanings.
• Thesizeofthetrainingdata,andthedistributionofthesensesofthetargetwordhassignificantinfluenttotheperformanceofthemodel.
ExperimentsandDiscussions
Conclusion
• Inthisproject,weutilizedthedistributionalwordrepresentationandthedistributionalhypothesistobuildamodularmodeltoclassifythesensesofambiguouswords.
• Ourexperimentsshowedourmodelperformedwellwhenanambiguouswordhadeachsenseaccountsforthan20%ofoccurrencesinthetrainingdataset.
ConclusionandFutureWork
FutureWork• Optimizetheclassifier.Onepossibleapproachmightbeusing
weightedsumofcontextsbytakingIDFintoaccount.• Extendandexperimentthisapproachtoothermodelswith
differentclassifiers.Theclassifierwhichworkswellwhenoccurrencesareskewedtooneclassmightimprovetheaccuracyforwordswithlargeportionofoccurrencesareusingthemostfrequentsense.
• Tokenizethecorpus,wecouldreducethetimecostoftrainingbyreducingvocabularysize.
ConclusionandFutureWork
• Y.Bengio,R.Ducharme,P.Vincent.Aneuralprobabilisticlanguagemodel.JournalofMachineLearningResearch,3:1137-1155,2003.
• TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.Efficientestimationofwordrepresentationsinvectorspace.ICLRWorkshop,2013.
• G.E.Hinton,J.L.McClelland,D.E.Rumelhart.Distributedrepresentations.In:Paralleldistributedprocessing:Explorationsinthemicrostructureofcognition.Volume1:Foundations,MITPress,1986.
• T.Brants,A.C.Popat,P.Xu,F.J.Och,andJ.Dean.Largelanguagemodelsinmachinetranslation.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLanguageProcessingandComputationalLanguageLearning,2007.
• DavidERumelhart,GeoffreyEHintont,andRonaldJWilliams.Learningrepresentationsbybackpropagating errors.Nature,323(6088):533–536,1986.
• H.Schwenk.Continuousspacelanguagemodels.ComputerSpeechandLanguage,vol.21,2007.• T.Mikolov,A.Deoras,S.Kombrink,L.Burget,J.Cˇernocky´.EmpiricalEvaluationandCombination
ofAdvancedLanguageModelingTechniques,In:ProceedingsofInterspeech,2011.
References
• TomasMikolov,IlyaSutskever,KaiChen,GregS.Corrado,andJeffDean.Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinNeuralInformationProcessingSystems,2013a.
• JamesR.CurranandMarcMoens.Improvementsinautomaticthesaurusextraction.InProceedingsoftheACL-02workshoponUnsupervisedlexicalacquisition,pages59–66.2002.
• PatrickPantel andDekang Lin.Discoveringwordsensesfromtext.InProc.OfSIGKDD-02,pages613–619,NewYork,NY,USA.ACM.2002.
• MichaelLesk.Automaticsensedisambiguationusingmachinereadabledictionaries:Howtotellapineconefromanicecreamcone.InProceedingsofSIGDOC,pages24-26,1986.
• Olah,Christopher.DeepLearning,NLP,andRepresentations.Retrievedfromhttp://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.2014
• Hartigan,J.A.andWong,M.A.AlgorithmAS136:AK-MeansClusteringAlgorithm.JournaloftheRoyalStatisticalSociety.SeriesC(AppliedStatistics).28(1):pages100–108,1979.
• Schütze,Hinrich.Dimensionsofmeaning.InProceedingsofSupercomputing’92,pages787-796,1992.
References
• Pedregosa etal.,Scikit-learn:MachineLearninginPython,JMLR12,pp.2825-2830,2011.• MichaelUGutmann andAapo Hyv¨arinen.Noise-contrastiveestimationofunnormalized
statisticalmodels,withapplicationstonaturalimagestatistics.TheJournalofMachine LearningResearch,13:307–361,2012.
• Bottou L.(2010)Large-ScaleMachineLearningwithStochasticGradientDescent.In:LechevallierY.,Saporta G.(eds)ProceedingsofCOMPSTAT'2010.Physica-Verlag HD
• TensorFlow Tutorial,tf.nn.nce_loss.Retriveved fromhttps://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.2017
• McCormick,C,Word2VecTutorialPart2- NegativeSampling.Retrievedfrom http://www.mccormickml.com,2017,January11.
• D.Yarowsky,Unsupervisedwordsensedisambiguationrivalingsupervisedmethods,Proc.33rdAnnualmeetingoftheACL,Cambridge,MA,USA,pp189-196,1995.
• Schütze,Hinrich,Automaticwordsensediscrimination,ComputationalLinguistics,v.24n.1,March1998
References
Questions
Thank You!
Appendix: ModelArchitecture
Skip-grammodelarchitecture• Wetrainedtheneuralnetworkbyfeedingitwordpairsoftargetword
andcontextwordfoundinourtrainingdataset.
top related