distributional semantics - github pages › slides › distributional... · 2019-12-05 · word...

Post on 03-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

JoãoSedocIntroHLTclass

November4,2019

DistributionalSemantics

 Nidaexample:A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.

 From context words humans can guess tesgüino means ◦  analcoholicbeveragelikebeer

 Intuitionforalgorithm:◦  Twowordsaresimilariftheyhavesimilarwordcontexts.

Intuition of distributional word similarity

If we consider optometrist and eye-doctor we find that, as our corpus of utterances grows, these two occur in almost the same environments. In contrast, there are many sentence environments in which optometrist occurs but lawyer does not... It is a question of the relative frequency of such environments, and of what we will obtain if we ask an informant to substitute any word he wishes for optometrist (not asking what words have the same meaning). These and similar tests all measure the probability of particular environments occurring with particular elements... If A and B have almost identical environments we say that they are synonyms.

–Zellig Harris (1954)

“You shall know a word by the company it keeps!” –John Firth (1957)

Distributional Hypothesis

Distributional models of meaning = vector-space models of meaning = vector semantics

Intuitions:ZelligHarris(1954):◦ “oculistandeye-doctor…occurinalmostthesameenvironments”

◦ “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”

Firth(1957):◦ “Youshallknowawordbythecompanyitkeeps!”

4

Intuition

 Modelthemeaningofawordby“embedding”inavectorspace.

 Themeaningofawordisavectorofnumbers◦ Vectormodelsarealsocalled“embeddings”.

 Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)

vec(“dog”)=(0.2,-0.3,1.5,…)

 vec(“bites”)=(0.5,1.0,-0.4,…) vec(“man”)=(-0.1,2.3,-1.5,…)

5

Term-document matrix

6

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-document matrix  Eachcell:countoftermtinadocumentd:tft,d:◦ Eachdocumentisacountvectorinℕv:acolumnbelow

7

 Twodocumentsaresimilariftheirvectorsaresimilar

8

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-document matrix

The words in a term-document matrix

 EachwordisacountvectorinℕD:arowbelow

9

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

 Twowordsaresimilariftheirvectorsaresimilar

10

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

The words in a term-document matrix

Brilliant insight: Use running text as implicitly supervised training data!

•  Awordsnearfine•  Actsasgold‘correctanswer’tothequestion•  “Iswordwlikelytoshowupnearfine?”

•  Noneedforhand-labeledsupervision•  Theideacomesfromneurallanguagemodeling

•  Bengioetal.(2003)•  Collobertetal.(2011)

Word2vec

 Popularembeddingmethod Veryfasttotrain Codeavailableontheweb Idea:predictratherthancount

◦ Insteadofcountinghowofteneachwordwoccursnear“fine”◦ Trainaclassifieronabinarypredictiontask:◦ Iswlikelytoshowupnear“fine”?

◦ Wedon’tactuallycareaboutthistask◦ Butwe'lltakethelearnedclassifierweightsasthewordembeddings

Word2vec

Word2vec

Dense embeddings you can download!

 Word2vec(Mikolovetal.) https://code.google.com/archive/p/word2vec/

 Fasttexthttp://www.fasttext.cc/

 Glove(Pennington,Socher,Manning) http://nlp.stanford.edu/projects/glove/

Why vector models of meaning? Computing the similarity between words

“fast”issimilarto“rapid”“tall”issimilarto“height”Questionanswering:Q:“HowtallisMt.Everest?”CandidateA:“TheofficialheightofMountEverestis29029feet”

18

Analogy: Embeddings capture relational meaning!

vector(‘king’)-vector(‘man’)+vector(‘woman’)≈ vector(‘queen’)

vector(‘Paris’)-vector(‘France’)+vector(‘Italy’)≈vector(‘Rome’)

19

Evaluating similarity  Extrinsic(task-based,end-to-end)Evaluation:◦ QuestionAnswering◦ SpellChecking◦ Essaygrading

 IntrinsicEvaluation:◦ Correlationbetweenalgorithmandhumanwordsimilarityratings◦ Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77

◦ TakingTOEFLmultiple-choicevocabularytests◦ Levied is closest in meaning to:

imposed, believed, requested, correlated

Evaluating embeddings  Comparetohumanscoresonwordsimilarity-typetasks:• WordSim-353(Finkelsteinetal.,2002)

•  SimLex-999(Hilletal.,2015)

•  StanfordContextualWordSimilarity(SCWS)dataset(Huangetal.,2012)

•  TOEFLdataset:Leviedisclosestinmeaningto:imposed,believed,requested,correlated

Intrinsic evaluation

Intrinsic evaluation

Measuring similarity

 Given2targetwordsvandw We’llneedawaytomeasuretheirsimilarity.

 Mostmeasureofvectorssimilarityarebasedonthe:

 Cosinebetweenembeddings!

◦ Highwhentwovectorshavelargevaluesinsamedimensions.◦ Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution

24

Visualizing vectors and angles

1 2 3 4 5 6 7

1

2

3

digital

apricotinformation

Dim

ensi

on 1

: ‘la

rge’

Dimension 2: ‘data’25

large dataapricot 2 0digital 0 1information 1 6

Bias in Word Embeddings

More to come on bias later …

Embeddings are the workhorse of NLP ◦ Usedaspre-initializationforlanguagemodels,neuralMT,classification,NERsystems…

◦ Downloadedandeasilytrainable◦ Availablein~100soflanguages

◦ And…contextualizedwordembeddingsarebuiltontopofthem

Contextualized Word Embeddings

BERT - Deep Bidirectional Transformers

Putting it together

• Multiple (N) layers

• For encoder-decoder attention, Q: previous decoder layer, K and V: output of encoder

• For encoder self-attention, Q/K/V all come from previous encoder layer

• For decoder self-attention, allow each position to attend to all positions up to that position

• Positional encoding for word order

Fromlasttime…

Transformer

Training BERT ◦ BERThastwotrainingobjectives:1. Predictmissing(masked)words2. Predictifasentenceisthenextsentence

BERT- predict missing words

BERT- predict is next sentence?

BERT on tasks

Take-Aways ◦ Distributionalsemantics–learnaword’s“meaning”byitscontext◦ Simplestrepresentationisfrequencyindocuments◦ Wordembeddings(Word2Vec,GloVe,FastText)predictratherthanuseco-occurrence

◦ Similaritymeasuredoftenusingcosine◦ Intrinsicevaluationusescorrelationwithhumanjudgementsofwordsimilarity

◦ ContextualizedembeddingsarethenewstarsofNLP

top related