cs388: natural language processing
TRANSCRIPT
CS388: Natural Language Processing
Eunsol Choi
Lecture 6: Word Embeddings
Slides adapted from Greg Durrett, Princeton NLP course
Representa?on of Words
2
‣ Tradi?onal approaches build one-hot feature vectors.
truck = [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] austin = [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
‣ The dimension of the vector: the size of the vocabulary! ‣ More over, no generaliza?on….
Distribu?onal Seman?cs
3
“tejuino”Context 1: A boPle of ___ is on the table.
Context 2: Everybody likes ___.
Context 3: Don’t have ___ before you drive.
Context 4: We make ___ out of corn.
the example is from Lin, 1998
Distribu?onal Seman?cs
‣ “You shall know a word by the company it keeps” Firth (1957)
slide credit: Dan Klein
Goal of Word Embeddings
5
‣ Learn a con$nuous, dense vector for each word type. ‣ Always the same vector, regardless of in which context the word appears
employees =
0
BBBBBBBBBBBB@
0.2860.792�0.177�0.10710.109�0.5420.3490.2710.487
1
CCCCCCCCCCCCA
<latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit>
Discrete Word Representa?ons
goodenjoyablegreat
0
fishcat
‣ Brown clusters: hierarchical agglomera?ve hard clustering (each word has one cluster)
dog…
isgo
0
0 1 1
11
1
10
0
Brown et al. (1992)
The cats are under the table.w1, w2, w3, w4, w5, w6
1001110
c1, c2, c3, c4, c5, c6
1001110
Cluster ids
Word ids10011110 1110111 110 100110
‣ Goal is to maximizeP (wi|wi�1) = P (ci|ci�1)P (wi|ci)
goodenjoyable
bad
dog
great
is
the movie was great
the movie was good~~
Word Embeddings
‣ Goal: come up with a way to produce these embeddings
‣ For each word, “medium” dimensional vector (50-300 dims) represen?ng it
Con?nuous Bag-of-Words‣ Predict word from context the dog bit the man
‣ Parameters: d x |V| (one d-length context vector per word), |V| x d output parameters (W)
dog
the
+
size d
sohmaxMul?ply by W
gold label = bit, no manual labeling required!
Mikolov et al. (2013)
d-dimensional word embeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size |V| x d
w−1, w, w+1
Skip-Gramthe dog bit the man‣ Predict one word of context from word
bit
sohmaxMul?ply by W
gold = dog
‣ Parameters: d x |V| word vector, |V| x d output parameters (W) (also usable as vectors!)
‣ Another training example: bit -> the
P (w0|w) = softmax(We(w))
Mikolov et al. (2013)
Hierarchical Sohmax
‣ Matmul + sohmax over |V| is very slow to compute for CBOW and SG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
…
…
thea
‣ Huffman encode vocabulary, use binary classifiers to decide which branch to take
Mikolov et al. (2013)
P (w0|w) = softmax(We(w))
‣ log(|V|) binary decisions
Skip-Gram with Nega?ve Sampling
‣ d x |V| vectors, d x |V| context vectors (same # of params as before)
Mikolov et al. (2013)
(bit, the) => +1(bit, cat) => -1(bit, a) => -1(bit, fish) => -1
‣ Take (word, context) pairs and classify them as “real” or not. Create random nega?ve examples by sampling from unigram distribu?on
words in similar contexts select for similar c vectors
P (y = 1|w, c) = ew·c
ew·c + 1
‣ Objec?ve = sampled
logP(y = 1 |w, c) +1k
k
∑i=1
logP(y = 0 |w, ci)
log
Connec?ons with Matrix Factoriza?on
Levy et al. (2014)
‣ Skip-gram model looks at word-word co-occurrences and produces two types of vectors
word pair counts
|V|
|V| |V|
d
d
|V|
context vecsword vecs
‣ Looks almost like a matrix factoriza?on… can we interpret it this way?
Skip-Gram as Matrix Factoriza?on
Levy et al. (2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
‣ If we sample nega?ve examples from the uniform distribu?on over words
num nega?ve samples
‣ …and it’s a weighted factoriza?on problem (weighted by word freq)
Skip-gram objec?ve exactly corresponds to factoring this matrix:
GloVe (Global Vectors)
Pennington et al. (2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ Objec?ve =
‣ Also operates on counts matrix, weighted regression on the log co-occurrence matrix
‣ Constant in the dataset size (just need counts), quadra?c in voc size
word pair counts
|V|
|V|
fastText: Sub-word Embeddings‣ Same as SGNS, but break words down into n-grams with n = 3 to 6
Bojanowski et al. (2017)
where: 3-grams: <wh, whe, her, ere, re> 4-grams: <whe, wher, here, ere>, 5-grams: <wher, where, here>, 6-grams: <where, where>
‣ Replace in skip-gram computa?on with w · c<latexit sha1_base64="okWze5eSFLfuKNB+HrglgUZNf6E=">AAADBXicfVLLbtQwFHXCqwyPTmHZjcWoUiLCKClIZYNUwYZlkZi20mQUOZ6bjlUnsewb2lGUDRt+hQ0LEGLLP7Djb3DSDKLTiivZOj7n3nv8SpUUBsPwt+PeuHnr9p2Nu4N79x883BxuPTo0ZaU5THgpS32cMgNSFDBBgRKOlQaWpxKO0tM3rX70AbQRZfEelwpmOTspRCY4Q0slW872TsykWrAk8oxPX9EYzpUXq4VIwDNBFMQ5w0Wa1eeN7w9WueiZBLtsU+VJbZIan0VNQ3u5W3k9619umWCA603/6rgqCtr+PrVaCnit39POr1O7hdeT63YdGdjpv6YYrMqt6RmN+bxEypPhKByHXdCrIOrBiPRxkAx/xfOSVzkUyCUzZhqFCmc10yi4hGYQVwYU46fsBKYWFiwHM6u7V2zojmXmNCu1HYV1b9l/K2qWG7PMU5vZnsSsay15nTatMHs5q0WhKoSCXxhllaRY0vZL0LnQwFEuLWBcC7tXyhdMM4724wzsJUTrR74KDnfH0fPx7rsXo/3X/XVskG3yhHgkIntkn7wlB2RCuPPR+ex8db65n9wv7nf3x0Wq6/Q1j8mlcH/+AZLG7d0=</latexit>
X
g2ngrams
wg · c!
<latexit sha1_base64="W7TRZGN2482/ctHk1BCU8sCouKE=">AAADMXicfVJNb9QwEHXCV1k+uoUjF4tVpUSEVVKQ4IJUwQGORWLbSutV5HidrFXHiewJ7CrKX+LCP0FcegAhrvwJnGwW0W3FSI6e35uZN3aclFIYCMNzx712/cbNWzu3B3fu3ru/O9x7cGyKSjM+YYUs9GlCDZdC8QkIkPy01JzmieQnydmbVj/5yLURhfoAq5LPcpopkQpGwVLxnvN2n1BZLmgcecbHrzDhy9Ij5ULE3DNBFJCcwiJJ62Xj+4NNLngmhi7bVHlcm7iGp1HT4F7udl7P+hdbxhDAdtO/OmyKgra/j62WcLjS70nn16ndxuvJbbuODOznv6YQbMqtKZE8BW9tlWEiFCbAl6DzWmWa5sYaf4qtwOYFYEa0yBbgx8NROA67wJdB1IMR6uMoHn4l84JVOVfAJDVmGoUlzGqqQTDJmwGpDC8pO6MZn1qoaM7NrO7+eIP3LTPHaaHtUnaIlv23orZjmlWe2Mz21GZba8mrtGkF6ctZLVRZAVdsbZRWEkOB2+eD50JzBnJlAWVa2FkxW1BNGdhHNrCXEG0f+TI4PhhHz8YH75+PDl/317GDHqHHyEMReoEO0Tt0hCaIOZ+db85354f7xT13f7q/1qmu09c8RBfC/f0HGUoAvg==</latexit>
‣ Advantages?
Extrinsinc vs. Intrinsic
19
‣ Plug the vectors into some end-task model, see which does well!
I
( 0.31−0.28) ( 0.01
−0.91) (1.870.03) (−3.17
−0.18) (1.231.59)
don’t like this movie
ML model
👎
‣ Evaluate on some intermediate subtasks (in the coming slides..)
Evalua?ng Word Embeddings‣ What proper?es of language should word embeddings capture?
goodenjoyable
bad
dog
great
is
cat
wolf
>ger
was
‣ (1) Similarity: similar words are close to each other
‣ (2) Analogy:
Paris is to France as Tokyo is to ???
good is to best as smart is to ???
Similarity
Levy et al. (2015)
‣ SVD = singular value decomposi?on on PMI matrix
‣ GloVe does not appear to be the best when experiments are carefully controlled, but it depends on hyperparameters + these dis?nc?ons don’t maPer in prac?ce
Analogies
queenking
womanman
(king - man) + woman = queen
‣ Why would this be?
‣ woman - man captures the difference in the contexts that these occur in
king + (woman - man) = queen
‣ Ohen used to evaluate word embeddings
What can go wrong with word embeddings?‣ What’s wrong with learning a word’s “meaning” from its usage?
‣ What data are we learning from?
‣ What are we going to learn from this data?
What do we mean by bias?‣ Iden?fy she - he axis in word vector space, project words onto this axis
Bolukbasi et al. (2016)
Debiasing
Bolukbasi et al. (2016)
‣ Iden?fy gender subspace with gendered words
she
he
homemaker
woman
man
‣ Project words onto this subspace
‣ Subtract those projec?ons from the original word
homemaker’
Hardness of Debiasing
Gonen and Goldberg (2019)
‣ Not that effec?ve…and the male and female words are s?ll clustered together
‣ Bias pervades the word embedding space and isn’t just a local property of a few words
Trained Word Embeddings
27
‣ word2vec: hPps://code.google.com/archive/p/word2vec/
‣ GloVe: hPps://nlp.stanford.edu/projects/glove/
‣ FastText: hPps://fasPext.cc/
Using Word Embeddings‣ Approach 1: learn embeddings as parameters from your data
‣ Approach 2: ini?alize using pretained word vectors (e.g., GloVe), keep fixed
‣ Approach 3: ini?alize using pretained word vectors (e.g., GloVe), fine-tune
‣ Faster because no need to update these parameters
‣ Works best for some tasks
‣ Ohen works prePy well
Preview: Context-dependent Embeddings
Peters et al. (2018)
‣ Train a neural language model to predict the next word given previous words in the sentence, use its internal representa?ons as word vectors
‣ Context-sensi>ve word embeddings: depend on rest of the sentence
‣ Huge improvements across nearly all NLP tasks over sta?c word embeddings
they hit the ballsthey dance at balls
‣ How to handle different word senses? One vector for balls
Composi?onal Seman?cs‣ What if we want embedding representa?ons for whole sentences?
‣ Skip-thought vectors (Kiros et al., 2015), similar to skip-gram generalized to a sentence level
‣ Is there a way we can compose vectors to make sentence representa?ons?