distributional semantics - github pages › slides › distributional... · 2019-12-05 · word...

JoãoSedocIntroHLTclass

November4,2019

DistributionalSemantics

 Nidaexample:A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.

 From context words humans can guess tesgüino means ◦  analcoholicbeveragelikebeer

 Intuitionforalgorithm:◦  Twowordsaresimilariftheyhavesimilarwordcontexts.

Intuition of distributional word similarity

If we consider optometrist and eye-doctor we find that, as our corpus of utterances grows, these two occur in almost the same environments. In contrast, there are many sentence environments in which optometrist occurs but lawyer does not... It is a question of the relative frequency of such environments, and of what we will obtain if we ask an informant to substitute any word he wishes for optometrist (not asking what words have the same meaning). These and similar tests all measure the probability of particular environments occurring with particular elements... If A and B have almost identical environments we say that they are synonyms.

–Zellig Harris (1954)

“You shall know a word by the company it keeps!” –John Firth (1957)

Distributional Hypothesis

Distributional models of meaning = vector-space models of meaning = vector semantics

Intuitions:ZelligHarris(1954):◦ “oculistandeye-doctor…occurinalmostthesameenvironments”

◦ “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”

Firth(1957):◦ “Youshallknowawordbythecompanyitkeeps!”

Intuition

 Modelthemeaningofawordby“embedding”inavectorspace.

 Themeaningofawordisavectorofnumbers◦ Vectormodelsarealsocalled“embeddings”.

 Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)

vec(“dog”)=(0.2,-0.3,1.5,…)

 vec(“bites”)=(0.5,1.0,-0.4,…) vec(“man”)=(-0.1,2.3,-1.5,…)

Term-document matrix

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-document matrix  Eachcell:countoftermtinadocumentd:tft,d:◦ Eachdocumentisacountvectorinℕv:acolumnbelow

 Twodocumentsaresimilariftheirvectorsaresimilar

Term-document matrix

The words in a term-document matrix

 EachwordisacountvectorinℕD:arowbelow

 Twowordsaresimilariftheirvectorsaresimilar

The words in a term-document matrix

Brilliant insight: Use running text as implicitly supervised training data!

•  Awordsnearfine•  Actsasgold‘correctanswer’tothequestion•  “Iswordwlikelytoshowupnearfine?”

•  Noneedforhand-labeledsupervision•  Theideacomesfromneurallanguagemodeling

•  Bengioetal.(2003)•  Collobertetal.(2011)

Word2vec

 Popularembeddingmethod Veryfasttotrain Codeavailableontheweb Idea:predictratherthancount

◦ Insteadofcountinghowofteneachwordwoccursnear“fine”◦ Trainaclassifieronabinarypredictiontask:◦ Iswlikelytoshowupnear“fine”?

◦ Wedon’tactuallycareaboutthistask◦ Butwe'lltakethelearnedclassifierweightsasthewordembeddings

Word2vec

Dense embeddings you can download!

 Word2vec(Mikolovetal.) https://code.google.com/archive/p/word2vec/

 Fasttexthttp://www.fasttext.cc/

 Glove(Pennington,Socher,Manning) http://nlp.stanford.edu/projects/glove/

Why vector models of meaning? Computing the similarity between words

“fast”issimilarto“rapid”“tall”issimilarto“height”Questionanswering:Q:“HowtallisMt.Everest?”CandidateA:“TheofficialheightofMountEverestis29029feet”

Analogy: Embeddings capture relational meaning!

vector(‘king’)-vector(‘man’)+vector(‘woman’)≈ vector(‘queen’)

vector(‘Paris’)-vector(‘France’)+vector(‘Italy’)≈vector(‘Rome’)

Evaluating similarity  Extrinsic(task-based,end-to-end)Evaluation:◦ QuestionAnswering◦ SpellChecking◦ Essaygrading

 IntrinsicEvaluation:◦ Correlationbetweenalgorithmandhumanwordsimilarityratings◦ Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77

◦ TakingTOEFLmultiple-choicevocabularytests◦ Levied is closest in meaning to:

imposed, believed, requested, correlated

Evaluating embeddings  Comparetohumanscoresonwordsimilarity-typetasks:• WordSim-353(Finkelsteinetal.,2002)

•  SimLex-999(Hilletal.,2015)

•  StanfordContextualWordSimilarity(SCWS)dataset(Huangetal.,2012)

•  TOEFLdataset:Leviedisclosestinmeaningto:imposed,believed,requested,correlated

Intrinsic evaluation

Measuring similarity

 Given2targetwordsvandw We’llneedawaytomeasuretheirsimilarity.

 Mostmeasureofvectorssimilarityarebasedonthe:

 Cosinebetweenembeddings!

◦ Highwhentwovectorshavelargevaluesinsamedimensions.◦ Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution

Visualizing vectors and angles

1 2 3 4 5 6 7

digital

apricotinformation

: ‘la

rge’

Dimension 2: ‘data’25

large dataapricot 2 0digital 0 1information 1 6

Bias in Word Embeddings

More to come on bias later …

Embeddings are the workhorse of NLP ◦ Usedaspre-initializationforlanguagemodels,neuralMT,classification,NERsystems…

◦ Downloadedandeasilytrainable◦ Availablein~100soflanguages

◦ And…contextualizedwordembeddingsarebuiltontopofthem

Contextualized Word Embeddings

BERT - Deep Bidirectional Transformers

Putting it together

• Multiple (N) layers

• For encoder-decoder attention, Q: previous decoder layer, K and V: output of encoder

• For encoder self-attention, Q/K/V all come from previous encoder layer

• For decoder self-attention, allow each position to attend to all positions up to that position

• Positional encoding for word order

Fromlasttime…

Transformer

Training BERT ◦ BERThastwotrainingobjectives:1. Predictmissing(masked)words2. Predictifasentenceisthenextsentence

BERT- predict missing words

BERT- predict is next sentence?

BERT on tasks

Take-Aways ◦ Distributionalsemantics–learnaword’s“meaning”byitscontext◦ Simplestrepresentationisfrequencyindocuments◦ Wordembeddings(Word2Vec,GloVe,FastText)predictratherthanuseco-occurrence

◦ Similaritymeasuredoftenusingcosine◦ Intrinsicevaluationusescorrelationwithhumanjudgementsofwordsimilarity

◦ ContextualizedembeddingsarethenewstarsofNLP

distributional semantics - github pages › slides › distributional... · 2019-12-05 · word...

Documents

fasttext - liangliang...

loguad: log unsupervised anomaly detection based on word2vec

distributional semantics crash course - brown...

deep learning for natural language...

word2vec -...

word2vec hendrik heuer from theory to...

word2vec and friends

a little survey on traditional and current word and...

what is word2vec?

word embedding e word2vec: introduzione ed esperimenti...

word2vec slide(lab seminar)

word2vec - from theory to practice

overview of lstms and word2vec - university of cambridge ·...

word2vec parameter learning explained

introduction to word2vec and its application to find...

compare word2vec with hash2vec for word sense ... ·...

efficient parallel learning of word2vec

distributional models for lexical semanticsneural word...

starspace: embed all the things! › deep2read ›...

chowa giken corporation · 2019. 8. 23. · project with a...