natural language processingjoberant/teaching/nlp_spring...“i like deep learning. i like nlp. i...

Post on 26-Jul-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Natural Language Processing

Word vectors

Many slides borrowed from Richard Socher ,Chris Manning, and Hugo Lachorelle

Lecture plan• Word representations

• Word vectors (embeddings)

• skip-gram algorithm

• Relation to matrix factorization

• Evaluation

2

Representing words

3

Representing wordsDefinition: meaning (Webster dictionary)

• the idea that is represented by a word, phrase, etc.

• The idea that a person wants to express by using words, signs, etc.

• the idea that is expressed in a work of writing, art, etc.

In linguistics:

signifier <—> signified (idea or thing) = denotation

4

Taxonomies

Taxonomies

“beverage”

Representing words with computers

A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: “is-a” relationHyponym: the opposite of ‘hypernym’

7

Drawbacks• Expensive!

• Subjective (how to split different synsets?)

• Incomplete

• wicked, badass, nifty, crack, ace, wizard, genius, ninja

• Missing functionality:

• how do you compute word similarity?

• How to compose meanings?

8

Discrete representationWords are atomic symbols (one-hot representation):

V = {hotel,motel,walk,wife, spouse}

|V| ⇡ 100, 000

hotel [1 0 0 0 0]

motel [0 1 0 0 0]

walk [0 0 1 0 0]

wife [0 0 0 1 0]

spouse [0 0 0 0 1]

9

DrawbackBarack Obama’s wife ≈ Barack Obama’s spouse Barack Obama’s wife ≉ Barack Obama’s advisors

Seattle motels ≈ Seattle hotels Seattle motels ≉ Seattle attractions

But all words vectors are orthogonal and equidistant

Goal: word vectors with a natural notion of similarity

h“hotel” · “motel”i > h“hotel” · “spouse”i 10

Distributional similarity“You shall know a word by the company it keeps”

(Firth, 1957)

“… cashed a check at the bank across the street…” “… that bank holds the mortgage on my home…” “… said that the bank raised his forecast for…” “… employees of the bank have confessed to the charges”

Central idea: represent words by their context

11

Idea 1word context

wife {met: 3, married: 4, children: 2, wedded: 1, …}

spouse {met: 2, married: 5, children: 2, kids: 1, …}

Problem: • married <==> wedded • children <==> kids

12

Distributed representations

language =

0.278�0.9110.792

�0.1770.109

�0.542�0.0003

• Represent words as low-dimensional vectors• Represent similarity with vector similarity metrics

13

Word vectors

14

Motivation

• Word embeddings are widely used

• (other options exist: word-parts, character-level,…).

• The great innovation of 2018 - contextualized word embeddings.

Supervised learning• Input: training set

• Output (probabilistic model):

• Example: train a spam detector from spam and non-spam e-mails.

f : X ! Yargmax

yp(y | x)

Intro to ML prerequisite 16

{(xi, yi)}Ni=1, (xi, yi) ⇠ D(X ⇥ Y)<latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit>

Word embeddings“… that bank holds the mortgage on my home…”

1. Define a supervised learning task from raw text (no manual annotation!):

1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …

17

Word embeddings2. Define model for output given input — p(“holds” | “bank”)

p✓(o | c) = exp(u>o vc)PV

w=1 exp(u>wvc)

• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters

• Multi-class classification model (number of classes?)

• How many parameters are in the model:

Intro to ML prerequisite 18

Word embeddings2. Define model for output given input — p(“holds” | “bank”)

p✓(o | c) = exp(u>o vc)PV

w=1 exp(u>wvc)

• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters

• Multi-class classification model (number of classes?)

• How many parameters are in the model:

|✓| = 2 · V · d u, v 2 Rd

Intro to ML prerequisite 18

Word embeddings3. Define objective function for corpus of length T:

L(✓) =TY

t=1

Y

�m j mj 6= 0

p✓(wt+j | wt)

J(✓) = logL(✓) =TX

t=1

X

�m j mj 6= 0

log p✓(wt+j | wt)

Find parameters that maximize the objective

Intro to ML prerequisite 19

Class 1 Recap

Intro to ML prerequisite

• Word representations:

• Ontology-based

• Pros: polysemy, similarity metrics

• Cons: expensive, compositionality, granularity

• One-hot

• Pros: cheap, simple, scales, compositionality

• Cons: no similarity

• Embeddings:

• Cheap, simple, scales, compositionality, similarity

Today

Intro to ML prerequisite

• Word2vec

• Efficiency:

• Hierarchical softmax

• Skipgram with negative sampling (assignment 1)

• Skipgram as matrix factorization

• Evaluation (GloVe)

Word embeddings“… that bank holds the mortgage on my home…”

1. Define a supervised learning task from raw text (no manual annotation!):

1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …

22

Mikolov et al., 2013

Word embeddings2. Define model for output given input — p(“holds” | “bank”)

p✓(o | c) = exp(u>o vc)PV

w=1 exp(u>wvc)

• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters

• We don’t really need the distribution - only the representation!

Intro to ML prerequisite 23

J(✓) = logL(✓) =TX

t=1

X

�m j mj 6= 0

log p✓(wt+j | wt)

Word embeddings

• What probabilities would maximize the objective?

Intro to ML prerequisite

Intro to ML prerequisite

L(⇥) =Y

c,o

p(o | c)#(c,o)

We can solve separately for each center word c

Lc(⇥) =Y

o

p(o | c)#(c,o)

solve for

Jc(⇥) =X

i

#(c, oi) log p(oi | c)

s.t.X

i

p(oi | c) = 1, p(oi | c) � 0

Use lagrange multipliers:

L(⇥,�) =X

i

#(c, oi) log p(oi | c)� �((X

i

p(oi | c))� 1)

rp(oi|c)L =#(c, oi)

p(oi | c)� � = 0

p(oi | c) =#(c, oi)

�X

i

p(oi | c) =X

i

#(c, oi)

�= 1

� =X

i

#(c, oi)

p(oi | c) =#(c, oi)Pi #(c, oi)

<latexit sha1_base64="y2++QIjhsNq4COprvrYiDUVbsEQ=">AAAFTHicjVRNb9NAEHWbtJTw0RaOXEZEVIlUohiQQEiVKrgglEORmqZSHaz1epKsuvaa3XWhWvkHcuHAjV/BhQMIIbGOXZKmDrCn8Xy893bGs0HCmdLd7peV1Vp9bf3axvXGjZu3bm9ubd85UiKVFPtUcCGPA6KQsxj7mmmOx4lEEgUcB8Hpyzw+OEOpmIgP9XmCw4iMYzZilGjr8rdrwY4XET2hhJte1vIOJ6hJG/bAS6QIfUN3RQZJS4AXsRBo+63xmi3rbGfgeY0dT+MHbQYIlMSgBD9DUJgQSTTycxgJCUjoBCjGGiW8FzIEWzgh2tASYEbumzJQIWNeRJlVSCk/FgQVSnL6wvu6GlulkW+YTZnD8VkbPC7GOZ/PLjPOM3R0JwfPIVhl7h64uxUBsChjfAfdObC+QuBkLEk8RohSrpkdvR3a80J9r9S8a4XZyYYz8ez/pT+8KG61lovO09zymjEJOPHNXBa1Pe7l1CNJqFlgzkwFXjajtXXFlat7tQSzLC5n+7dmX/TjHzh2KgXWH1XLGmmTllEt4agEyhoNf6vZ7XSnB64abmk0nfIc+FufvVDQNLJbQzlR6sTtJnpoiNSMcswaXpovGT0lYzyxZkwiVEMzfQwyeGA94fTfH4lYw9Q7X2FIpNR5FNjMfPfUYix3VsVOUj16NjQsTlKNMS2IRikHLSB/WSBkEqm2ax8yQiWzWoFO7FNA7eqrvAnu4pWvGkePOu7jjvvmSXP/RdmODeeec99pOa7z1Nl3XjkHTt+htY+1r7XvtR/1T/Vv9Z/1X0Xq6kpZc9e5dNbWfwMbE7W3</latexit>

Questions• Intuitions:

• Why should similar words have similar vectors?

• Why do we have different parameters for the center word and the output word?

26

27

Gradient descent3. How to find parameters that minimize the objective?

• Start at some point and move in the opposite direction of the gradient

Intro to ML prerequisite 28

Gradient descent

f(x) = x4 + 3x3 + 2

f 0(x) = 4x3 + 9x2

Intro to ML prerequisite 29

Gradient descent• We want to minimize:

J(✓) = �TX

t=1

X

j

log p✓(wt+j | wt)

• Update rule:

✓newj = ✓oldj � ↵@J(✓)

@✓j

✓new = ✓old � ↵rJ(✓)

• 𝛂 is a step size

✓ 2 R2V d

Intro to ML prerequisite 31

Stochastic gradient descent• For large corpora (billions of tokens) this update is

very slow

• Sample a window t

• Update gradients based on that window

✓new = ✓old � ↵rJt(✓)

Intro to ML prerequisite 32

Deriving the gradient• Mostly applications of the chain rule

• Let’s derive the gradient of a center word for a single output word

• You will do this again in the assignment (and more)

log p✓(wt+j | wt)

33

Gradient derivationL(⇥) = log p(o | c) = log

exp(u>o vc)P

i exp(u>oivc)

= u>o vc � log

X

i

exp(u>oivc)

rvcL(⇥) = uo �1P

j exp(u>ojvc)

·X

i

exp(u>oivc) · uoi

= uo �X

i

exp(u>oivc)P

j exp(u>ojvc)

· uoi

= uo �X

i

p(oi | c) · oi = uo � Eoi⇠p(oi|c)[uoi ]<latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit>

Recap• Goal: represent words with low-dimensional vectors

• Approach: Define a supervised learning problem from a corpus

• We defined the necessary components for skip-gram:

• Model (softmax over word labels for each word)

• Objective (minimize Negative Log Likelihood)

• Optimize with SGD

• We computed the gradient for some parameters by hand

35

Computational problem• Computing the partition function is too expensive

• Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log|V| by constructing a binary tree over the vocabulary

• Solution 2: Change the objective

• skip-gram with negative sampling (home assignment 1)

36

Hierarchical softmax• p(“cat” | “dog”) = p(left at 1) x p(right at 2) x p(right at 5)

= (1 - p(right at 1)) x p(right at 2) x p(right at 5)dog

2 3

4 75 6

1

he she and cat the have be are

p(cat | dog) = (1� �(o>1 cdog))

= ⇥�(o>2 cdog))

= ⇥�(o>5 cdog))

Hierarchical softmax• How to construct the tree?

• Randomly (doesn’t work well but better than you’d think)

• Using external knowledge like WordNet

• Learn word representations somehow and then cluster

Skip-gram with Negative Sampling

(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)

39

What information is lost?

Skip-gram with Negative Sampling

(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)

39

What information is lost?X

o2Vp(y = 1 | o, c) =?

Skip-gram with Negative Sampling

• Model:

p✓(y = 1 | c, o) = 1

1 + exp(�u>o vc)

= �(u>o vc)

p✓(y = 0 | c, o) = 1� �(u>o vc) = �(�u>

o vc)

• Objective:

• p(w) = U(w)3/4 / T

X

t,j

�log(�(u>

wt+jvwt)) +

X

k⇠p(w)

log(�(�u>w(k)vwt))

Intro to ML prerequisite 40

Summary

• We defined the three necessary components.

• Model (binary classification)

• Objective (ML with negative sampling)

• Optimization method (SGD)

41

Many variants• CBOW: predict center word from context

• Defining context:

• How big is the window?

• Is it sequential or based on syntactic information?

• Different model for every context position?

• Use stop words?

• …

42

Matrix factorization

43

Matrix factorization• Consider the word-context co-occurrence matrix for a

corpus:

“I like deep learning. I like NLP. I enjoy flying.”

I Like enjoy deep learning NLP flying .I 2 1

like 2 1 1enjoy 1 1deep 1 1

learning 1 1NLP 1 1flying 1 1

. 1 1 1

Landauer and Dumais (1997)

44 Intro to ML prerequisite

Matrix factorization• Reconstruct matrix from low-dimensional word-

context representations.

• Minimizes: X

i,j

(Aij � Akij)

2 = ||A� Ak||2

45

Matrix factorization

46

Relation to skip-gram• The output of skip-gram can be viewed as

factorizing a word-context matrix:

×=

M VUT

M 2 R|V|⇥|V|, V, U 2 R|V|⇥d

• What M is decomposed by skip-gram?

Levy and Goldberg, 2015

47

Relation to skip-gram#(c) =

X

o0

#(c, o0)

#(o) =X

c0

#(c0, o)

T =X

(c,o)

#(c, o)

#(o)

T: Unigram probability of o

PT : unigram distribution

PT (w) =c(w)

|D| =c(w) ·m|D| ·m =

#(o)

T<latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit>

Relation to skip-gram• Re-write objective:

49

distribute

expectationis constant

for o

Openexpectation

Gather terms

L(✓) =X

c,o

#(c, o)�log(�(u>

o vc)) + k · Eo0⇠PT [log(�(�u>o0vc))]

=X

c,o

#(c, o) log(�(u>o vc)) +

X

c,o

#(c, o) · k · Eo0⇠PT [log(�(�u>o0vc))]

=X

c,o

#(c, o) log(�(u>o vc)) +

X

c

#(c) · k · Eo0⇠PT [log(�(�u>o0vc))]

=X

c,o

#(c, o) log(�(u>o vc)) +

X

c

#(c) · k ·X

o0

#(o0)

Tlog(�(�u>

o0vc))

=X

c,o

#(c, o) log(�(u>o vc)) + #(c) · k · #(o)

Tlog(�(�u>

o vc))

Relation to skip-gram• Let’s assume the dot products are independent of

one another:Let x = u>

o vc

l(x) = #(c, o) log(�(x)) + #(c) · k · #(o)

Tlog(�(�x))

L(✓) =X

c,o

l(x)

@l(x)

@x= #(c, o)�(�x)�#(c) · k · #(o)

T�(x) = 0

x = log

✓#(c, o) · T#(c) ·#(o)

· 1k

x = log

✓p(c, o)

p(c) · p(o)

◆� log k = PMI(c, o)� log k

50

Relation to skip-gram• Conclusion: Skip-gram with negative sampling

implicitly factorizes a “shifted” PMI matrix

• Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors.

51

Evaluation

52

Evaluation

• Intrinsic vs. extrinsic evaluation:

• Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm (a bit of that in home assignment 1).

• Extrinsic: check whether your output is useful in a real NLP task

53

Intrinsic evaluation• Word analogies:

• Normalize all word vectors to 1

• man::woman <—> king::??

• a::b <—> c::d d = argmaxi

(xb � xa + xc)>xi

||xb � xa + xc||

54

Visualization

55

Visualization

56

Visualization

57

GloVe• An objective that attempts to create a semantic

space with linear structure

Pennington et al., 2014

• Probability ratios are more important than probabilities

GlovePennington et al., 2014

• Try to find word embeddings such that (roughly):

• As an example:

(vc1 � vc2)>uo =

Pc1o

Pc2o

Pco is the probability of an output word o given a center word c<latexit sha1_base64="vdM0OrDgI/C1mR7dOcbuj6cWHWk=">AAACcXicbZBNbxMxEIa9S4E2fAXoBSHQqBFVK0S0GyqVC1IFF45BIm2lbFh5ndnEqtde2bOBaLV3fh83/kQv/QM4mxxoy0iWHr/zYc+blUo6iqI/QXhn6+69+9s7nQcPHz1+0n367NSZygocCaOMPc+4QyU1jkiSwvPSIi8yhWfZxedV/myB1kmjv9GyxEnBZ1rmUnDyUtr9tX+wSGuRxg28g5YGzeH3hEwJVWrgIyS55aIetjWmadY08ARJ0tlf3UyTEP6kGqQDmiOU1mQ8k0rSEkwOXIOpqKwIfhg7BQMzuUANHARqQrtWRZN2e1E/agNuQ7yBHtvEMO3+TqZGVIWfIhR3bhxHJU1qbkkKhU0nqRyWXFzwGY49al6gm9StYw288coUcmP90QSt+m9HzQvnlkXmKwtOc3cztxL/lxtXlH+Y1FL7fVGL9UN5pYAMrOyHqbQoSC09cGGl/yuIOfcWeytcx5sQ31z5NpwO+vH7fvz1qHfyaWPHNnvJ9tgBi9kxO2Ff2JCNmGCXwW7wKngdXIUvQgj31qVhsOl5zq5F+PYvS9u75A==</latexit>

vice � vsteam ⇡ usolid

vsteam � vice ⇡ ugas<latexit sha1_base64="s94IU64EXbOLJ9TUCXpfNRBAhX8=">AAACXXicbVFNS8NAEN3E7/hV9eDBy2JRvFgSFfRY9OJRwVahKWGzndbFTTbsTkpLyJ/0phf/its2iLYOLDzemzez+zbOpDDo+x+Ou7S8srq2vuFtbm3v7Nb29ttG5ZpDiyup9EvMDEiRQgsFSnjJNLAklvAcv91N9OchaCNU+oTjDLoJG6SiLzhDS0U1HEZFiDDCQnAoS3pOfwiDdo6lTkOWZVqNaB5VgpKiV9Iw9BZ6f9ln8xbMA2bKqFb3G/606CIIKlAnVT1Etfewp3ieQIpcMmM6gZ9ht2AaBZdQemFuIGP8jQ2gY2HKEjDdYppOSU8s06N9pe1JkU7Z346CJcaMk9h2Jgxfzbw2If/TOjn2b7qFSLMcIeWzRf1cUlR0EjXtCQ0c5dgCxrWwd6X8lWnG0X6IZ0MI5p+8CNoXjeCyETxe1Zu3VRzr5IgckzMSkGvSJPfkgbQIJ58OcTYcz/lyV9wtd2fW6jqV54D8KffwG1M7uA8=</latexit>

Word analogies evaluation

60

Human correlation intrinsic evaluation

word 1 word 2 human judgement

tiger cat 7.35

book paper 7.46

computer internet 7.58

plane car 5.77

stock phone 1.62

stock CD 1.31

stock jaguar 0.92 61

Human correlation intrinsic evaluation

• Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353):

62

Extrinsic evaluation• Task: named entity recognition. Find mentions of

person, location, organization in text.

• Using good word representation might be useful

63

Extrinsic evaluation

64

Summary• Words are central to language

• In most NLP systems some word representations are used

• Graph-based representations are difficult to manipulate and compose

• One-hot vectors are useful with enough data but lose all of generalization information

• Word embeddings provide a compact way to encode word meaning and similarity (but what about inference relations?)

• Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem

• Strongly related to classical matrix decomposition methods.

65

Current Research

• Contextualized word representations

• Sentence representations

Assignment 1

• Implement skip-gram with negative sampling

• There is ample literature if you want to consider this for a project

67

Gradient checks

• This is the single parameter case • For parameter vectors, iterate over all parameters

and compute the numerical gradient for each one

@J(✓)

@✓= lim

✏!0

J(✓ + ✏)� J(✓ � ✏)

2✏

68

top related