cs388: natural language processing

31
CS388: Natural Language Processing Eunsol Choi Lecture 6: Word Embeddings Slides adapted from Greg Durrett, Princeton NLP course

Upload: others

Post on 21-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

CS388: Natural Language Processing

Eunsol Choi

Lecture 6: Word Embeddings

Slides adapted from Greg Durrett, Princeton NLP course

Representa?on of Words

2

‣ Tradi?onal approaches build one-hot feature vectors.

truck = [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] austin = [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

‣ The dimension of the vector: the size of the vocabulary! ‣ More over, no generaliza?on….

Distribu?onal Seman?cs

3

“tejuino”Context 1: A boPle of ___ is on the table.

Context 2: Everybody likes ___.

Context 3: Don’t have ___ before you drive.

Context 4: We make ___ out of corn.

the example is from Lin, 1998

Distribu?onal Seman?cs

‣ “You shall know a word by the company it keeps” Firth (1957)

slide credit: Dan Klein

Goal of Word Embeddings

5

‣ Learn a con$nuous, dense vector for each word type. ‣ Always the same vector, regardless of in which context the word appears

employees =

0

BBBBBBBBBBBB@

0.2860.792�0.177�0.10710.109�0.5420.3490.2710.487

1

CCCCCCCCCCCCA

<latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit>

Discrete Word Representa?ons

goodenjoyablegreat

0

fishcat

‣ Brown clusters: hierarchical agglomera?ve hard clustering (each word has one cluster)

dog…

isgo

0

0 1 1

11

1

10

0

Brown et al. (1992)

The cats are under the table.w1, w2, w3, w4, w5, w6

1001110

c1, c2, c3, c4, c5, c6

1001110

Cluster ids

Word ids10011110 1110111 110 100110

‣ Goal is to maximizeP (wi|wi�1) = P (ci|ci�1)P (wi|ci)

goodenjoyable

bad

dog

great

is

the movie was great

the movie was good~~

Word Embeddings

‣ Goal: come up with a way to produce these embeddings

‣ For each word, “medium” dimensional vector (50-300 dims) represen?ng it

word2vec / GloVe (Global Vectors)

word2vec

9

Con?nuous Bag of Words (CBOW) Skip-grams

Con?nuous Bag-of-Words‣ Predict word from context the dog bit the man

‣ Parameters: d x |V| (one d-length context vector per word), |V| x d output parameters (W)

dog

the

+

size d

sohmaxMul?ply by W

gold label = bit, no manual labeling required!

Mikolov et al. (2013)

d-dimensional word embeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size |V| x d

w−1, w, w+1

Skip-Gramthe dog bit the man‣ Predict one word of context from word

bit

sohmaxMul?ply by W

gold = dog

‣ Parameters: d x |V| word vector, |V| x d output parameters (W) (also usable as vectors!)

‣ Another training example: bit -> the

P (w0|w) = softmax(We(w))

Mikolov et al. (2013)

Hierarchical Sohmax

‣ Matmul + sohmax over |V| is very slow to compute for CBOW and SG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

thea

‣ Huffman encode vocabulary, use binary classifiers to decide which branch to take

Mikolov et al. (2013)

P (w0|w) = softmax(We(w))

‣ log(|V|) binary decisions

Skip-Gram with Nega?ve Sampling

‣ d x |V| vectors, d x |V| context vectors (same # of params as before)

Mikolov et al. (2013)

(bit, the) => +1(bit, cat) => -1(bit, a) => -1(bit, fish) => -1

‣ Take (word, context) pairs and classify them as “real” or not. Create random nega?ve examples by sampling from unigram distribu?on

words in similar contexts select for similar c vectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ Objec?ve = sampled

logP(y = 1 |w, c) +1k

k

∑i=1

logP(y = 0 |w, ci)

log

Connec?ons with Matrix Factoriza?on

Levy et al. (2014)

‣ Skip-gram model looks at word-word co-occurrences and produces two types of vectors

word pair counts

|V|

|V| |V|

d

d

|V|

context vecsword vecs

‣ Looks almost like a matrix factoriza?on… can we interpret it this way?

Skip-Gram as Matrix Factoriza?on

Levy et al. (2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ If we sample nega?ve examples from the uniform distribu?on over words

num nega?ve samples

‣ …and it’s a weighted factoriza?on problem (weighted by word freq)

Skip-gram objec?ve exactly corresponds to factoring this matrix:

GloVe (Global Vectors)

Pennington et al. (2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Objec?ve =

‣ Also operates on counts matrix, weighted regression on the log co-occurrence matrix

‣ Constant in the dataset size (just need counts), quadra?c in voc size

word pair counts

|V|

|V|

fastText: Sub-word Embeddings‣ Same as SGNS, but break words down into n-grams with n = 3 to 6

Bojanowski et al. (2017)

where: 3-grams: <wh, whe, her, ere, re> 4-grams: <whe, wher, here, ere>, 5-grams: <wher, where, here>, 6-grams: <where, where>

‣ Replace in skip-gram computa?on with w · c<latexit sha1_base64="okWze5eSFLfuKNB+HrglgUZNf6E=">AAADBXicfVLLbtQwFHXCqwyPTmHZjcWoUiLCKClIZYNUwYZlkZi20mQUOZ6bjlUnsewb2lGUDRt+hQ0LEGLLP7Djb3DSDKLTiivZOj7n3nv8SpUUBsPwt+PeuHnr9p2Nu4N79x883BxuPTo0ZaU5THgpS32cMgNSFDBBgRKOlQaWpxKO0tM3rX70AbQRZfEelwpmOTspRCY4Q0slW872TsykWrAk8oxPX9EYzpUXq4VIwDNBFMQ5w0Wa1eeN7w9WueiZBLtsU+VJbZIan0VNQ3u5W3k9619umWCA603/6rgqCtr+PrVaCnit39POr1O7hdeT63YdGdjpv6YYrMqt6RmN+bxEypPhKByHXdCrIOrBiPRxkAx/xfOSVzkUyCUzZhqFCmc10yi4hGYQVwYU46fsBKYWFiwHM6u7V2zojmXmNCu1HYV1b9l/K2qWG7PMU5vZnsSsay15nTatMHs5q0WhKoSCXxhllaRY0vZL0LnQwFEuLWBcC7tXyhdMM4724wzsJUTrR74KDnfH0fPx7rsXo/3X/XVskG3yhHgkIntkn7wlB2RCuPPR+ex8db65n9wv7nf3x0Wq6/Q1j8mlcH/+AZLG7d0=</latexit>

X

g2ngrams

wg · c!

<latexit sha1_base64="W7TRZGN2482/ctHk1BCU8sCouKE=">AAADMXicfVJNb9QwEHXCV1k+uoUjF4tVpUSEVVKQ4IJUwQGORWLbSutV5HidrFXHiewJ7CrKX+LCP0FcegAhrvwJnGwW0W3FSI6e35uZN3aclFIYCMNzx712/cbNWzu3B3fu3ru/O9x7cGyKSjM+YYUs9GlCDZdC8QkIkPy01JzmieQnydmbVj/5yLURhfoAq5LPcpopkQpGwVLxnvN2n1BZLmgcecbHrzDhy9Ij5ULE3DNBFJCcwiJJ62Xj+4NNLngmhi7bVHlcm7iGp1HT4F7udl7P+hdbxhDAdtO/OmyKgra/j62WcLjS70nn16ndxuvJbbuODOznv6YQbMqtKZE8BW9tlWEiFCbAl6DzWmWa5sYaf4qtwOYFYEa0yBbgx8NROA67wJdB1IMR6uMoHn4l84JVOVfAJDVmGoUlzGqqQTDJmwGpDC8pO6MZn1qoaM7NrO7+eIP3LTPHaaHtUnaIlv23orZjmlWe2Mz21GZba8mrtGkF6ctZLVRZAVdsbZRWEkOB2+eD50JzBnJlAWVa2FkxW1BNGdhHNrCXEG0f+TI4PhhHz8YH75+PDl/317GDHqHHyEMReoEO0Tt0hCaIOZ+db85354f7xT13f7q/1qmu09c8RBfC/f0HGUoAvg==</latexit>

‣ Advantages?

Evalua?ng Word Embeddings

Extrinsinc vs. Intrinsic

19

‣ Plug the vectors into some end-task model, see which does well!

I

( 0.31−0.28) ( 0.01

−0.91) (1.870.03) (−3.17

−0.18) (1.231.59)

don’t like this movie

ML model

👎

‣ Evaluate on some intermediate subtasks (in the coming slides..)

Evalua?ng Word Embeddings‣ What proper?es of language should word embeddings capture?

goodenjoyable

bad

dog

great

is

cat

wolf

>ger

was

‣ (1) Similarity: similar words are close to each other

‣ (2) Analogy:

Paris is to France as Tokyo is to ???

good is to best as smart is to ???

Similarity

Levy et al. (2015)

‣ SVD = singular value decomposi?on on PMI matrix

‣ GloVe does not appear to be the best when experiments are carefully controlled, but it depends on hyperparameters + these dis?nc?ons don’t maPer in prac?ce

Analogies

queenking

womanman

(king - man) + woman = queen

‣ Why would this be?

‣ woman - man captures the difference in the contexts that these occur in

king + (woman - man) = queen

‣ Ohen used to evaluate word embeddings

What can go wrong with word embeddings?‣ What’s wrong with learning a word’s “meaning” from its usage?

‣ What data are we learning from?

‣ What are we going to learn from this data?

What do we mean by bias?‣ Iden?fy she - he axis in word vector space, project words onto this axis

Bolukbasi et al. (2016)

Debiasing

Bolukbasi et al. (2016)

‣ Iden?fy gender subspace with gendered words

she

he

homemaker

woman

man

‣ Project words onto this subspace

‣ Subtract those projec?ons from the original word

homemaker’

Hardness of Debiasing

Gonen and Goldberg (2019)

‣ Not that effec?ve…and the male and female words are s?ll clustered together

‣ Bias pervades the word embedding space and isn’t just a local property of a few words

Trained Word Embeddings

27

‣ word2vec: hPps://code.google.com/archive/p/word2vec/

‣ GloVe: hPps://nlp.stanford.edu/projects/glove/

‣ FastText: hPps://fasPext.cc/

Using Word Embeddings‣ Approach 1: learn embeddings as parameters from your data

‣ Approach 2: ini?alize using pretained word vectors (e.g., GloVe), keep fixed

‣ Approach 3: ini?alize using pretained word vectors (e.g., GloVe), fine-tune

‣ Faster because no need to update these parameters

‣ Works best for some tasks

‣ Ohen works prePy well

Preview: Context-dependent Embeddings

Peters et al. (2018)

‣ Train a neural language model to predict the next word given previous words in the sentence, use its internal representa?ons as word vectors

‣ Context-sensi>ve word embeddings: depend on rest of the sentence

‣ Huge improvements across nearly all NLP tasks over sta?c word embeddings

they hit the ballsthey dance at balls

‣ How to handle different word senses? One vector for balls

Composi?onal Seman?cs‣ What if we want embedding representa?ons for whole sentences?

‣ Skip-thought vectors (Kiros et al., 2015), similar to skip-gram generalized to a sentence level

‣ Is there a way we can compose vectors to make sentence representa?ons?

Summary

‣ Next ?me: Language Models!

‣ Lots of pretrained embeddings work well in prac?ce, they capture some desirable proper?es

‣ Even bePer: context-sensi?ve word embeddings (ELMo)