word sense determination from wikipedia data using neural ... · in proceedings of the joint...

WordSenseDeterminationfromWikipediaDataUsing

NeuralNetworks

AdvisorDr. Chris Pollett

Committee MembersDr. JonPearceDr. Suneuy Kim

ByQiaoLiu

Agenda

• Introduction• Background• ModelArchitecture• DataSetsandDataPreprocessing• Implementation• ExperimentsandDiscussions• ConclusionandFutureWork

Introduction

• Wordsensedisambiguationisthetaskofidentifyingwhichsenseofanambiguouswordisusedinasentence.

in1890,hebecamecustodianoftheMilwaukeepublicmuseumwherehecollectedplant specimensfortheirgreenhouse

…...sendcollectedfluidtoamunicipalsewagetreatmentplant oracommercialwastewatertreatmentfacility

• Wordsensedisambiguationisusefulinnaturallanguageprocessingtasks,suchasspeechsynthesis,questionanswering,andmachinetranslation.

Introduction

Sensediscrimination Senselabeling

WordSenseDisambiguation

Lexicalsampletask

All-wordstaskProjectpurpose

• Twovariantsofwordsensedisambiguationtask:

lexicalsampletaskall-wordstask

• Twosubtasks:sensediscriminationsenselabeling

Introduction

Sensediscrimination Senselabeling

WordSenseDisambiguation

Lexicalsampletask

All-wordstaskProjectpurpose

• Twovariantsofwordsensedisambiguationtask:

lexicalsampletaskall-wordstask

• Twosubtasks:sensediscriminationsenselabeling

Background

ExistingWork

Background

Approach1:Dictionary-based

Givenatargetwordt tobedisambiguatedinContextc.1. retrieveallthesensedefinitionsfortfromadictionary.2. selectthesenseswhosedefinitionhavethemostoverlapwithcoft.

• Thisapproachrequiresahand-builtmachinereadablesemanticsensedictionary.

Background

Approach2:Supervisedmachinelearning

1. Extractasetoffeaturesfromthecontextofthetargetword.2. Usethefeaturetotrainclassifiersthatcanlabelambiguouswordsin

newtext.

• Thisapproachrequirescostlylargehand-builtresources,becauseeachambiguouswordneedbelabelledintrainingdata.

• Asemi-supervisedapproachwasproposedin1995byYarowsky.Inthisapproach,theydonotrelyonalargehand-builtdata,duetousingbootstrappingtogeneratedictionaryfromasmallhand-labeledseed-set.

Background

Approach3:Unsupervisedmachinelearning

Interpretthesenseoftheambiguouswordasclustersofsimilarcontexts.Contextsandwordsarerepresentedbyahigh-dimensional,real-valuedvectorusingco-occurrencecounts.

• Inourproject,weuseamodificationofthisapproach:• Wordembeddings aretrainedusingWikipediapages.• Wordvectorsofcontextscomputedbytheseembeddingarethenclustered.• Givenanewwordtodisambiguate,weuseitscontextandtheword

embeddingtofindawordvectorcorrespondingtothiscontext.Thenwedeterminetheclusteritbelongs.

• Inrelatedwork,Schütze usedadatasettakenfromtheNewYorkTimesNewsService anddidclusteringbutwithadifferentkindofwordvector.

Background

• Wordembeddings

Awordembeddingisaparameterizedfunctionmappingwordsinsomelanguagetohigh-dimensionalvectors(perhaps200to500dimensions)

word→𝑅"W(“plant”)=[0.3,-0.2,0.7,…]W(“crane”)=[0.5,0.4-0.6,…]

ModelArchitecture

• ManyNLPtaskstaketheapproachoffirstlearningagoodwordrepresentationonataskandthenusingthatrepresentationforothertasks.Weusedthisapproachforthewordsensedeterminationtask.

ModelArchitecture

• Learnagoodwordrepresentationofataskandthenusingthatrepresentationforothertasks.

• WeusedtheSkip-grammodelastheneuralnetworklanguagemodellayer

ModelArchitecture

Skip-gramModelArchitecture• Thetrainingobjectivewastolearnwordembeddings goodatpredictingthe

contextwordsinasentence.• Wetrainedtheneuralnetworkbyfeedingitwordpairsoftargetwordand

contextwordfoundinourtrainingdataset.

𝐽$ 𝜃 = ( ( 𝑝(𝑤,-.|𝑤,; 𝜃1�

345.54.67

𝐽 𝜃 = −1𝑉> > log( 𝑝(𝑤,-.|𝑤,; 𝜃)1

345.54.67

𝑝 𝑤C 𝑤, =ex p(𝑤CG𝑤,)

∑ ex p(𝑤.G𝑤,18.9:

• k-meansclustering

k-meansisasimpleunsupervisedclassificationalgorithm.Theaimofthek-meansalgorithmistodividempointsinndimensionsintokclusterssothatthewithin-clustersumofsquaresisminimize

Thedistributionalhypothesissaysthatsimilarwordsappearinsimilarcontexts[9,10].Thus,wecanusek-meanstodivideallvectorsofcontextintokclusters.

ModelArchitecture

• Datasourcehttps://dumps.wikimedia.org/enwiki/20170201/Thepages-articles.xml ofWikipediadatadumpcontainscurrentversionofallarticlepages,templates,andotherpages.

• TrainingdataformodelWordpairs:(targetword,contextword)

DataSetsandDataPreprocessing

Sentence Trainingsamples (windowsize=2)

natural languageprocessingprojectsarefun (natural,language), (natural,processing)

naturallanguage processingprojectsarefun (language,natural), (language,processing), (language,projects)

naturallanguageprocessing projectsarefun (processing,natural), (processing,language), (processing,projects)

naturallanguageprocessingprojects arefun (projects,language), (projects,processing), (projects,are), (projects,fun)

naturallanguageprocessingprojectsare fun (are,processing), (are,project), (are,fun)

naturallanguageprocessingprojectsarefun (fun,projects), (fun,are)

DataSetandDataPreprocessing

Stepstoprocessdata:• Extracted90Msentences

• Countedwords,createdadictionaryandareverseddictionary

• Regeneratedsentences

• Created5Bwordpairs

Implementation

Theoptimizer:• Gradientdescent findstheminimumofafunctionbytakingsteps

proportionaltothe positive ofthegradient.Ineachiterationofgradientdescent,weneedtocalculateallexamples.

• Insteadofcomputingthegradientofthewholetrainingset,eachiterationofstochasticgradientdescent onlyestimatesthisgradientbasedonabatchofrandomlypickedexamples.

Weusedstochasticgradientdescenttooptimizethevectorrepresentationduringtraining.

Implementation

Theparameters:Parameters Meaning

VOC_SIZE Thevocabularysize.

SKIP_WINDOW Thewindowsizeoftextwordsaroundtargetword.

NUM_SKIPS Thenumberofcontextwords,whichwillberandomlytooktogeneratewordpairs.

EMBEDDING_SIZE Thenumberofparametersinthewordembedding.Thesizeofthewordvector.

LR Thelearningrateofgradientdescent

BATCH_SIZE Thesizeofeachbatchinstochasticgradientdescent.Runningonebatch isonestep.

NUM_STEPS Thenumberoftrainingstep.

NUM_SAMPLE Thenumberofnegativesamples.

Implementation

Toolsandpackages:

• TensorFlow r1.4• TensorBoard 0.1.6• Python2.7.10• WikipediaExtractorv2.55• sklearn.cluster [15]• numpy

ExperimentsandDiscussions

TheexperimentalresultsarecomparedwithSchütze’sunsupervisedlearningapproachin1998:• Schütze usedadataset(435M)takenfromtheNewYork

TimesNewsService.WeusedthedatasetextractedfromWikipediapages(12G).

• Schütze usedco-occurrencecountstogeneratevectors,whichhadlargenumbersofvectordimension(1,000/2,000).WeusedtheSkip-grammodeltolearnadistributedwordrepresentationwithadimensionof250.

• Schütze appliedsingular-valuedecompositionduetolargenumbersofvectordimension.Takingadvantageofasmallernumberofdimension,wedidnotneedtoperformmatrixdecomposition.

• WeexperimentedtheSkip-grammodelwithdifferentparametersandselectedonewordembeddingforclustering.

• Skip-grammodelparameters

Experimentwithskip-grammodel• Used“averageloss”toestimatetheloss

overevery100Kbatches.• Visualizedsomewords’nearestwords.

Experimentwithclassifyingwordsenses• Clusteredthecontextsoftheoccurrencesofgivenambiguouswordinto

two/threecoherentgroups.• Manuallyassignedlabelstotheoccurrencesofambiguouswordsinthetest

corpus,andcomparethemwithmachinelearnedlabelstocalculateaccuracy.• Beforewordsensedetermination,weassignedalloccurrencestothemost

frequentmeaning,andusedthefractionasthebaseline.

𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑤𝑖𝑡ℎ𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑚𝑎𝑐ℎ𝑖𝑛𝑒𝑙𝑒𝑎𝑟𝑛𝑒𝑑𝑠𝑒𝑛𝑠𝑒𝑙𝑎𝑏𝑒𝑙𝑇ℎ𝑒𝑡𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑡𝑒𝑠𝑡𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

accuracy =

• “Schütze’s baseline”columngivesthefractionofthemostfrequentsenseinhisdatasets.

• “Schütze’s accuracy”columngivestheresultsofhisdisambiguationexperimentswithlocaltermsfrequencyifapplicable.

• Wegotbetteraccuracyoutofexperimentswith“capital”and“plant”.

• However,themodelcannotdeterminethesensesofword“interest”and“sake”,whichhasabaselineover85%inourdatasets.

Discussions• Ourdatasets(12G)aremuchlargerthanSchütze’s datasets(435M).

Forexample,thesizeofhistrainingsetforword“capital”is13,015,andoursis179,793.Thelargerdatasetsmighthavehelpedtoincreasetheaccuracyforsomewords.

• Wealsoobservedthatwhenthebaselineishigh(>=85%),themodelcannotdeterminethesensesoftheword.Theperformanceofunsupervisedlearningreliesonsufficientinformationfromthetrainingdata.However,themodeldidn’tgettrainedwithsufficientdatacarryinglessfrequentmeanings.

• Thesizeofthetrainingdata,andthedistributionofthesensesofthetargetwordhassignificantinfluenttotheperformanceofthemodel.

Conclusion

• Inthisproject,weutilizedthedistributionalwordrepresentationandthedistributionalhypothesistobuildamodularmodeltoclassifythesensesofambiguouswords.

• Ourexperimentsshowedourmodelperformedwellwhenanambiguouswordhadeachsenseaccountsforthan20%ofoccurrencesinthetrainingdataset.

ConclusionandFutureWork

FutureWork• Optimizetheclassifier.Onepossibleapproachmightbeusing

weightedsumofcontextsbytakingIDFintoaccount.• Extendandexperimentthisapproachtoothermodelswith

differentclassifiers.Theclassifierwhichworkswellwhenoccurrencesareskewedtooneclassmightimprovetheaccuracyforwordswithlargeportionofoccurrencesareusingthemostfrequentsense.

• Tokenizethecorpus,wecouldreducethetimecostoftrainingbyreducingvocabularysize.

ConclusionandFutureWork

• Y.Bengio,R.Ducharme,P.Vincent.Aneuralprobabilisticlanguagemodel.JournalofMachineLearningResearch,3:1137-1155,2003.

• TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.Efficientestimationofwordrepresentationsinvectorspace.ICLRWorkshop,2013.

• G.E.Hinton,J.L.McClelland,D.E.Rumelhart.Distributedrepresentations.In:Paralleldistributedprocessing:Explorationsinthemicrostructureofcognition.Volume1:Foundations,MITPress,1986.

• T.Brants,A.C.Popat,P.Xu,F.J.Och,andJ.Dean.Largelanguagemodelsinmachinetranslation.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLanguageProcessingandComputationalLanguageLearning,2007.

• DavidERumelhart,GeoffreyEHintont,andRonaldJWilliams.Learningrepresentationsbybackpropagating errors.Nature,323(6088):533–536,1986.

• H.Schwenk.Continuousspacelanguagemodels.ComputerSpeechandLanguage,vol.21,2007.• T.Mikolov,A.Deoras,S.Kombrink,L.Burget,J.Cˇernocky´.EmpiricalEvaluationandCombination

ofAdvancedLanguageModelingTechniques,In:ProceedingsofInterspeech,2011.

References

• TomasMikolov,IlyaSutskever,KaiChen,GregS.Corrado,andJeffDean.Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinNeuralInformationProcessingSystems,2013a.

• JamesR.CurranandMarcMoens.Improvementsinautomaticthesaurusextraction.InProceedingsoftheACL-02workshoponUnsupervisedlexicalacquisition,pages59–66.2002.

• PatrickPantel andDekang Lin.Discoveringwordsensesfromtext.InProc.OfSIGKDD-02,pages613–619,NewYork,NY,USA.ACM.2002.

• MichaelLesk.Automaticsensedisambiguationusingmachinereadabledictionaries:Howtotellapineconefromanicecreamcone.InProceedingsofSIGDOC,pages24-26,1986.

• Olah,Christopher.DeepLearning,NLP,andRepresentations.Retrievedfromhttp://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.2014

• Hartigan,J.A.andWong,M.A.AlgorithmAS136:AK-MeansClusteringAlgorithm.JournaloftheRoyalStatisticalSociety.SeriesC(AppliedStatistics).28(1):pages100–108,1979.

• Schütze,Hinrich.Dimensionsofmeaning.InProceedingsofSupercomputing’92,pages787-796,1992.

References

• Pedregosa etal.,Scikit-learn:MachineLearninginPython,JMLR12,pp.2825-2830,2011.• MichaelUGutmann andAapo Hyv¨arinen.Noise-contrastiveestimationofunnormalized

statisticalmodels,withapplicationstonaturalimagestatistics.TheJournalofMachine LearningResearch,13:307–361,2012.

• Bottou L.(2010)Large-ScaleMachineLearningwithStochasticGradientDescent.In:LechevallierY.,Saporta G.(eds)ProceedingsofCOMPSTAT'2010.Physica-Verlag HD

• TensorFlow Tutorial,tf.nn.nce_loss.Retriveved fromhttps://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.2017

• McCormick,C,Word2VecTutorialPart2- NegativeSampling.Retrievedfrom http://www.mccormickml.com,2017,January11.

• D.Yarowsky,Unsupervisedwordsensedisambiguationrivalingsupervisedmethods,Proc.33rdAnnualmeetingoftheACL,Cambridge,MA,USA,pp189-196,1995.

• Schütze,Hinrich,Automaticwordsensediscrimination,ComputationalLinguistics,v.24n.1,March1998

References

Questions

Thank You!

Appendix: ModelArchitecture

Skip-grammodelarchitecture• Wetrainedtheneuralnetworkbyfeedingitwordpairsoftargetword

andcontextwordfoundinourtrainingdataset.

word sense determination from wikipedia data using neural ... · in proceedings of the joint...

Documents

an empirical study of smoothing techniques for language

human language technology conference and conference on...

empirical methods in natural language processing lecture 2...

american sign language wikipedia october 2016€¦ · ·...

linguistic neighbourhoods: explaining cultural borders on...

list of language proficiency test as per wikipedia

the wikipedia education program in armenia 2015 - english...

how wikipedia helps you get your language supported by major...

list of countries and capitals with currency and language ...

empirical studies about wikipedia

manypedia: comparing language points of view of wikipedia...

programming language - wikipedia, the free encyclopedia

in search of the ur-wikipedia: universality, similarity...

towards a comprehensive, empirical model of language

proceedings of the 2012 joint conference on empirical...

an empirical study: usage of the unified modeling language

language independent identification of parallel sentences...

the empirical basis of second language teaching and learning

wikipedia-based semantic interpretation for natural language

dbpedia mappings crowd-sourcing · structure in wikipedia...