deep learning: a statistical perspectivestat.snu.ac.kr/mcp/nlp_lecture-2.pdf · 2018-07-23 ·...

Post on 17-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction

Deep Learning: A Statistical Perspective

Myunghee Cho PaikGuest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung

Kim and Youngwon Choi

Seoul National University

March-June, 2018

Seoul National University Deep Learning March-June, 2018 1 / 56

Introduction

Introduction

Seoul National University Deep Learning March-June, 2018 1 / 56

Introduction

Natural Language Processing

Natural Language Processing (NLP) includes:

Sentiment analysisMachine translationText generation....

How to train language?

How can we convert language into numbers?

Seoul National University Deep Learning March-June, 2018 2 / 56

Introduction

Word Embedding

How to map words in to Rd?

One-hot encoding

Each vectors has nothing to do with other vectors∀u 6= v , ‖u − v‖ = 1, uT v = 0

However...

Each word is related with its companies.“Ice” is closer to “Solid” than “Gas”

Seoul National University Deep Learning March-June, 2018 3 / 56

Introduction

Main Questions in Word Embedding

Vocabulary set: V = {a, the,deep,statistics,. . .,}Size N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VGiven the corpus data, how can we measure similarity between words,sim(deep, statistics)?

How can we define f and learn wdeep,wstatistics such thatsim(deep, statistics) = f (wdeep,wstatistics)?

Seoul National University Deep Learning March-June, 2018 4 / 56

Introduction

Some Famous Word Embedding Techniques

Latent Semantic Analysis (LSA) (Deerwester et al. 1990)

Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996)

Word2Vec (Mikolov et al. 2013a)

GloVe (Pennington et al. 2014)

Seoul National University Deep Learning March-June, 2018 5 / 56

Introduction

LSA (Deerwester et al. 1990)

Term-document matrix: Xt×d

sim(a, b) ∝ co-occurence in eachdocuments.

Sigular Value Decomposition:Xt×d = TSDT

With k largest singular values:

X̂t×d = Tt×kSk×k(Dd×k)T

Tt×k : k-dim term vectors,Dd×k : k-dim document vectors

Figure: from (Deerwester et al. 1990)Seoul National University Deep Learning March-June, 2018 6 / 56

Introduction

HAL (Lund and Burgess, 1996)

Term-context term matrix: XV×V

How many times the column word appear in front of the row term?

sim(a, b) ∝ co-occurence in nearby context.

Concatenate row/columnto make 2V -dim vector

Dimension reductionwith k principalcomponents.

Train 160M terms, withV = 70, 000

Figure: from (Lund and Burgess, 1996)

Seoul National University Deep Learning March-June, 2018 7 / 56

Introduction

Sim(·, ·) ∝ Cooccurence?

Cooccurence with ”and” or ”the” does not mean semantic similarity.

Appears just frequently? or has significant similarity?

Transformation or defining new measure of similarity.

Entropy/Correlation based normalization (Rohde et al., 2006)

Positive pointwise mutual information(PPMI) max{0, log p(context|term)p(context) }

(Bullinaria and Levy, 2007)Square root type transformation (Lebret and Collobert, 2014)Train p(context|term) within every local window. (Word2Vec)

Seoul National University Deep Learning March-June, 2018 8 / 56

Word2Vec

Word2Vec

Seoul National University Deep Learning March-June, 2018 9 / 56

Word2Vec

Model Setup (Mikolov et al., 2013)

Vocabulary set: V = {e1, e2, . . . , eV } ⊂ {0, 1}VSize N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VEmbedded word vectors:

Wd×V =

| | |w1 w2 · · · wV

| | |

,W ′d×V =

| | |w ′1 w ′2 · · · w ′V| | |

Seoul National University Deep Learning March-June, 2018 10 / 56

Word2Vec

Model Setup (Mikolov et al., 2013)

Thus, the model becomes:

P(v (output)|v (input)) =exp(w ′output · winput)∑Vj=1 exp(w ′j · winput)

W /W ′ is called input/output representation

Note that, W 6= W ′. If W = W ′, P(·|·) is maximized when”context word = input word” which is a rare event.

If the output (or context) word appears in the window, w ′output · winput

increases.

Seoul National University Deep Learning March-June, 2018 11 / 56

Word2Vec

Training the Model (Rong, 2014)

Initialize W → Read a (context, input) pair → Update W ′ →Update W → Read another (context, input) pair → · · ·Initialization: Wij = U[−0.5, 0.5] ∀i , jSuppose v (output) = eo appeared in the context of v (input) = ei .

Update W ′ by minimizing − log-likelihood:

L ≡ − logP(eo |ei ) = log(V∑j=1

exp(uij))− uio

where uij = w ′j · wi , j = 1, . . .V

Seoul National University Deep Learning March-June, 2018 12 / 56

Word2Vec

Training the Model (Rong, 2014)

Taking derivatives:

∂L

∂uik=

exp(uik)∑Vj=1 exp(uij)

− δ(k=o)

∂uik∂w ′k

= wi

∂L

∂w ′k= [P(ek |ei )− δ(k=o)]wi , ∀k = 1, . . . ,V

With gradient descent, the updating equation:

w ′k(new) = w ′k(old) − α[P(ek |ei )− δ(k=o)]wi ,∀k = 1, . . . ,V

If k = o, [P(ek |ei )− δk=o ] < 0. This indicates underestimating case.Thus the updating equation adds wi -direction on w ′kIn summary, the updating equation increases uio and decreasesuik , ∀k 6= o.

Seoul National University Deep Learning March-June, 2018 13 / 56

Word2Vec

Training the Model (Rong, 2014)

Given W ′, update W .

Reminder: v (output) = eo appeared in the context of v (input) = ei .

Taking derivatives w.r.t. wi

∂L

∂uik=

exp(uik)∑Vj=1 exp(uij)

− δ(k = o)

∂uik∂wi

= w ′k

∂L

∂wi=

V∑j=1

∂L

∂uij

∂uij∂wi

=V∑j=1

[P(ej |ei )− δ(j=o)]w′j

Define EH =∑V

j=1 [P(ej |ei )− δ(j=o)]w′j : sum of output vectors,

weighted by their prediction error.

With gradient descent, the updating equation:wi(new) = wi(old) − αEH

Seoul National University Deep Learning March-June, 2018 14 / 56

Word2Vec

CBOW and Skip-gram (Mikolov et al., 2013)

Figure: CBOW ModelFigure: Skip-gram Model

Seoul National University Deep Learning March-June, 2018 15 / 56

Word2Vec

Training CBOW (Rong, 2014)

Input: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c). Output: v (t)

suppose v (t−c) = et(1), . . . , v(t+c) = et(2c), and v (t) = eo .

Define:

ht ≡1

2c

±c∑j=1

Wv (t+j) =1

2c

2c∑k=1

wt(k)

Suppose Then the model becomes

P(eo |et(1), . . . , et(2c)) =exp(w ′o · ht)∑Vj=1 exp(w ′j · ht)

The loss is defined by negative log-likelihood:

L ≡ logV∑j=1

exp(w ′j · ht)− w ′o · ht

where utj = w ′j · ht ,∀j = 1, . . .V .

Seoul National University Deep Learning March-June, 2018 16 / 56

Word2Vec

Training CBOW (Rong, 2014)

With similar calculation, the updating equation for W ′ becomes:

w ′k(new) = w ′k(old) − α[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]ht

For W , note that utj = w ′j · ht = 12c

∑2ck=1 w

′j · wt(k)

For back propagation:

∂L

∂wt(k)=

V∑j=1

∂L

∂utj

∂utj∂wt(k)

=1

2c

V∑j=1

[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]w′j =

1

2cEH

Thus the updating equation becomes:

wt(k)(new) = w t(k)(old) − α1

2cEH,∀k = 1, . . . , 2c

Seoul National University Deep Learning March-June, 2018 17 / 56

Word2Vec

Training Skip-gram (Rong, 2014)

Input:v (t). Output: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c).

Suppose, v (t) = ei , and v (t−c) = et(1), . . . , v(t+c) = et(2c). Then the

model becomes:

P(v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c)|v (t)) =2c∏k=1

P(et(k)|ei )

=2c∏k=1

exp(w ′t(k) · wi )∑Vj=1 exp(w ′j · wi )

The loss becomes:

L ≡2c∑k=1

Lk =2c∑k=1

[logV∑j=1

exp(u(k)ij )− u

(k)it(k)]

where u(k)ij = w ′j · wi : score for only k-th loss.

Seoul National University Deep Learning March-June, 2018 18 / 56

Word2Vec

Training Skip-gram (Rong, 2014)

For W ′, j = 1, . . . ,V

∂L

∂w ′j=

2c∑k=1

∂Lk

∂u(k)ij

∂u(k)ij

∂w ′j

=2c∑k=1

[P(et(k)|ei )− δ(j=t(k))]wi

Thus, the updating equation for W ′ becomes:

w ′j(new) = w ′j(old) − α2c∑k=1

[P(et(k)|ei )− δ(j=t(k))]wi ,∀j = 1, . . .V

Seoul National University Deep Learning March-June, 2018 19 / 56

Word2Vec

Training Skip-gram (Rong, 2014)

For W ,

∂L

∂wi=

2c∑k=1

V∑j=1

∂Lk

∂u(k)ij

∂u(k)ij

∂wi

=2c∑k=1

V∑j=1

[P(et(k)|ei )− δ(j=t(k))]w′j ≡

2c∑k=1

EH(k)

Thus, the updating equation for W becomes:

wi(new) = w i(old) − α2c∑k=1

EH(k)

Seoul National University Deep Learning March-June, 2018 20 / 56

Word2Vec

Compuational Problem

For each input, output pair in corpus C = (v (1), v (2), . . . , v (N)), themodel must calculate:

P(eo |ei ) =exp(w ′o · wi )∑Vj=1 exp(w ′j · wi )

For each epoch, almost N × V times of inner product of d-dimvectors. (skip-gram: N × V × 2c)

The calculation is proportional to V ≈ 106.

(Mikolov et al., 2013) suggests 2 alternative formulations:Hierarchical softmax and Negative sampling

Seoul National University Deep Learning March-June, 2018 21 / 56

Word2Vec

Hierarchical softmax (Mikolov et al., 2013)

Efficient way of computingsoftmax

Build a Huffman binary treeusing word frequency

Instead of w ′j the model usesw ′n(ej ,l)

n(ej , l): l-th node to the wayfrom root to the word ej

Figure: Binary tree for HS

Let hi be the hidden node. Then the probability model becomes:

P(eo |ei ) =

L(eo)−1∏l=1

σ([n(eo , l+1) is at left child of n(eo , l)]×w ′n(eo ,l)·hi )

Seoul National University Deep Learning March-June, 2018 22 / 56

Word2Vec

Training Hierarchical Softmax (Rong, 2014)

Let L = − logP(eo |ei ), and w ′n(eo ,l) = w ′l . Then:

∂L

∂w ′l · hi= {σ([· · · ]w ′l · hi )− 1}[· · · ] =

{σ(w ′l · hi )− 1 [· · · ] = 1

σ(w ′l · hi ) [· · · ] = −1

σ(w ′l · hi ) is the probability of [w ′l+1 is left child node of w ′l ]. Thus,

∂L

∂w ′l · hi= P[w ′l+1 is left child node of w ′l ]− δ[··· ]

Thus the updating equation becomes: for l = 1, . . . , L(eo)− 1

w ′l(new) = w ′l(old) − α(P[w ′l+1 is left child node of w ′l ]− δ[··· ])hiFor skip-gram model, repeat this procedure for 2c outputs.The updating equation for W becomes:

wi(new) = w ′i(old) − αEH∂hi∂wi

where, EH =

L(eo)−1∑l=1

(P[· · · ]− δ[··· ])w ′l

Seoul National University Deep Learning March-June, 2018 23 / 56

Word2Vec

Nagative Sampling (Mikolov et al., 2013)

Generate en(1), . . . , en(k) from the noise distribution Pn

The goal is to discriminate (hi , eo) from (hi , en(1)), . . . , (hi , en(k))

For skip gram model, repeat this procedure with each 2c outputs.

k = 5− 20. are useful. For large datasets, k can be small as 2− 5.

The noise distribution: Pn(en) ∝[#(en)N

]3/4outperformed

significantly.

Figure: 5-Negative Sampling

Seoul National University Deep Learning March-June, 2018 24 / 56

Word2Vec

Objective in Negative Sampling (Goldberg and Levy, 2014)

Suppose (hi , eo) and (hi , en(1)), . . . , (hi , en(k)),(o 6= n(j),∀j = 1, . . . k) is given.Let [D = 1|hi , ej ] be the event that the pair (hi , ej) is came from theoriginal corpus.The model assumes: P(D = 1|hi , ej) = σ(w ′j · hi ). Thus thelikelihood becomes:

σ(w ′o · hi )×k∏

j=1

[1− σ(w ′n(j) · hi )

]Taking log leads to the objective in (Mikolov et al. 2013):

log σ(w ′o · hi ) +k∑

j=1

log σ(−w ′n(j) · hi ) en(j) ∼ Pn

Note that training hi given w ′o ,w′n(1), . . . ,w

′n(k), is a logistic

regression.

Seoul National University Deep Learning March-June, 2018 25 / 56

Word2Vec

Training Negative Sampling (Rong, 2014)

Define loss as:

L = − log σ(w ′o · hi )−k∑

j=1

log σ(−w ′n(j) · hi )

Let Wneg = {w ′n(1), . . . ,w′n(k)}. Then the derivative:

∂L

∂w ′j · hi=

{σ(w ′j · hi )− 1 w ′j = w ′oσ(w ′j · hi ) w ′j ∈ Wneg

= P(D = 1|hi , ej)− δ(j=o)

Thus the updating equation for W ′: for j = o, n(1), . . . , n(k),

w ′j(new) = w ′j(old) − α[P(D = 1|hi , ej)− δ(j=o)]hi

Let ∂L∂hi

=∑n(k)

j=1,n(1)(P(D = 1|hi , ej)− δ(j=o))w′j ≡ EH. Then the

updating equation for W :

wi(new) = w i(old) − αEH∂hi∂wi

Seoul National University Deep Learning March-June, 2018 26 / 56

Word2Vec

Two Pre-processing Techniques (Mikolov et al., 2013)

Frequent words (such as “a”, “the”, “in”) provide less informationvalue than rare words.

Let V = {v1, . . . , vV }, be the vocabulary set. Discard each word viwith probability:

P(vi ) = 1−√

t

[#(vi )/N]

where t = 10−5 is a proper threshold value.

“New York Times”, “Toronto Maple Leafs” can be considered as oneword.

In order to find those phrases, define a score:

score(vi , vj) =#(vivj)− δ#(vi )#(vj)

Over 2-4 cycles of the training set, calculate the score with decreasingδ. Above some threshold value, set vivj as a word.

Seoul National University Deep Learning March-June, 2018 27 / 56

GloVe

GloVe

Seoul National University Deep Learning March-June, 2018 28 / 56

GloVe

Motivation (Pennington et al., 2014)

Let V = {v1, . . . , vV }, be the vocabulary set.

Throughout the corpus C, define some statistics:

Xij : #(word vj is in the context of word vi )Xi ≡

∑k Xik : #(Any word appear in the context of vi )

Pij = Xij/Xi : Probability that vj appear in the context of vi

How can we measure similarity between words, sim(vi , vj)?

Seoul National University Deep Learning March-June, 2018 29 / 56

GloVe

Motivation (Pennington et al., 2014)

Co-occurence probabilities for “ice” and “steam” with selectedcontext words from a corpus (N=6 billion)

If vk is related to vi rather than vj , than Pik/Pjk will be larger than 1.

If vk is related (or not related) to both vi and vj , then Pik/Pjk willclose to 1.

The ratio Pik/Pjk is useful to find out whether vk is close to vi (or vj)

Figure: from (Pennington et al., 2014)

Seoul National University Deep Learning March-June, 2018 30 / 56

GloVe

Model Setup (Pennington et al., 2014)

With the motivation, the model becomes:

Pik

Pjk= F (wi ,wj ,w

′k) wi ,wj ,w

′k ∈ Rd

Setting 2 kinds of parameters W ,W ′ can help reduce overfitting,noise and generally improve results (Ciresan et al., 2012)

In vector space, knowing w1, . . . ,wV is same as knowingw1 − wi , . . . ,wV − wi . Thus the F can be restricted to:

Pik

Pjk= F (wi − wj ,w

′k)

In order to match the dimension and preserve the linear structure, usedot products:

Pik

Pjk= F

[(wi − wj) · w ′k

]Seoul National University Deep Learning March-June, 2018 31 / 56

GloVe

Model Setup (Pennington et al., 2014)

For any i , j , k , l = 1, . . . ,V ,

F[(wi − wj) · w ′k

]F[(wj − wl) · w ′k

]=

Pik

Plk= F

[(wi − wl) · w ′k

]It is natural to define F satisfying F (x)F (y) = F (x + y). This impliesF = exp(·).

Moreover:

F[(wi − wj) · w ′k

]=

exp(wi · w ′k)

exp(wj · w ′k)=

Pik

Pjk

Thus,wi · w ′k = logPik = logXik − logXi

Since the role of a word and a context is exchangable,wi · w ′k = wk · w ′i .

Seoul National University Deep Learning March-June, 2018 32 / 56

GloVe

Model Setup (Pennington et al., 2014)

Consider logXi as a bias of input representation: bi and add anotherbias b′k .

Finally, the model becomes:

wi · w ′k + bi + b′k = logXik

Now, define a weighted cost function:

L =V∑

i ,j=1

f (Xij)(wi · w ′j + bi + b′j − logXij)2

The weight must satisfy:

f (0) = 0: In order to avoid the case Xij = 0.f must be non-decreasing: frequent co-occurence must be emphasizedf should be relatively small for large values: case of “in”,”the”,”and”

Seoul National University Deep Learning March-June, 2018 33 / 56

GloVe

Training GloVe (Pennington et al., 2014)

f is suggested as:

f (x) =

{(x/xmax)α x < xmax

1 x ≥ xmax

xmax is reported to have weak impact on performance.(fix xmax = 100)

α = 3/4 has a modest improvement over α = 1.

Training with AdaGrad (Duchi et al., 2011), stocastically samplingnon-zero elements of X .

The model generates W and W ′. The model concludes with W +W ′.

Seoul National University Deep Learning March-June, 2018 34 / 56

Toy Implementation

Toy Implementation

Seoul National University Deep Learning March-June, 2018 35 / 56

Toy Implementation

Data and Model Descriptions

Movie review data from NLTK corpus.

Consist of plot summary and critique.

Corpus size N = 1.5million, Vocabulary size V = 39768.

Embedding dimension:d = 100, window size:c = 5.

Negative sample size: k = 5.

GloVe trained with 10 epochs.

Time elapsed for training (Intel Core i7 CPU @ 3.60GHz):

Model CBOW+HS CBOW+NEG SG+HS SG+NEG GloVe

Time 9.14s 4.53s 12.4s 12.3s 44.2s

Seoul National University Deep Learning March-June, 2018 36 / 56

Toy Implementation

Results

Similarity between two vectors

Seoul National University Deep Learning March-June, 2018 37 / 56

Toy Implementation

Results

Similarity between two vectors (most frequent words)

Seoul National University Deep Learning March-June, 2018 38 / 56

Toy Implementation

Results

Top 5 similar words with “villian”

Seoul National University Deep Learning March-June, 2018 39 / 56

Toy Implementation

Results

Linear relationship: (“actor”+”she”-”actress”=?)

Seoul National University Deep Learning March-June, 2018 40 / 56

Toy Implementation

Results

Linear relationship: (“king”+”she”-”he”=?)

Seoul National University Deep Learning March-June, 2018 41 / 56

Performances

Performances

Seoul National University Deep Learning March-June, 2018 42 / 56

Performances

Intrinsic Performances (Pennington et al., 2014)

Word analogies task: 19,544 questions

Symantic:” Athens” is to “Greece” as “Berlin” is to ( ? )Syntatic: “dance” is to “dancing” as fly is to ( ? )

Corpus: Gigaword5 + Wikipedia2014

Percentage of correct answers:

Model d N Sem. Syn. Tot.

CBOW 300 6B 63.6 67.4 65.7

SG 300 6B 73.0 66.0 69.1

GloVe 300 6B 77.4 67.0 71.7

Table: From (Pennington et al., 2014)

Seoul National University Deep Learning March-June, 2018 43 / 56

Performances

Extrinsic Performances (Pennington et al., 2014)

Named entity recognition (NER) with Conditional Random Field(CRF) model

Input: Jim bought 300 shares of Acme Corp. in 2006

Output: [Jim](person) bought 300 shares of [AcmeCorp.](Organization) in 2006

4 Entities: person, location, organization, miscellaneous.

Seoul National University Deep Learning March-June, 2018 44 / 56

Performances

Extrinsic Performances (Pennington et al., 2014)

Trained with CoNLL-03 training set and 50-dimensional word vectors.

F1 score on validation set and 3 kinds of test sets:

Model Validation CoNLL-Test ACE MUC7

Discrete 91.0 85.4 77.4 73.4

CBOW 93.1 88.2 82.2 81.1

SG None None None None

GloVe 93.2 88.3 82.9 82.2

Table: From (Pennington et al., 2014)

Seoul National University Deep Learning March-June, 2018 45 / 56

Word Embedding + RNN

Word Embedding + RNN

Seoul National University Deep Learning March-June, 2018 46 / 56

Word Embedding + RNN

How to Add Embedded Vectors to RNN

Recall RNN model:

Input: xtHidden unit: ht = tanh(b + Uhht−1 + Uixt)Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (Ui ,Uo ,Uh, b, c)

Seoul National University Deep Learning March-June, 2018 47 / 56

Word Embedding + RNN

How to Add Embedded Vectors to RNN

With word embeddings:

Input: wi(t) = WxtHidden unit: ht = tanh(b + Uhht−1 + Uiw i(t))Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (W ,Ui ,Uo ,Uh, b, c)

W is not just input. Instead, it is the initial weight of the wordvectors.

Fine tuning the word vectors for specific goal.

Another derivative is added: for k = 1, . . . ,V

∂L

∂wk=∑

i(t)=k

∂L

∂ot

∂ot∂ht

∂ht∂wk

Can be generalized to LSTM and GRU.

Seoul National University Deep Learning March-June, 2018 48 / 56

Word Embedding + RNN

Word-rnn (Eidnes, 2015)

Goal: Generating clickbait headlines

Train 2M clickbait headlines scraped from Buzzfedd, Gawker, Jezebel,Huffington Post and Upworthy

RNN model using GloVe words vectors (N = 6B, d = 200) as initialweights.

3-layer LSTM model with T = 1200.

Seoul National University Deep Learning March-June, 2018 49 / 56

Word Embedding + RNN

Word-rnn (Eidnes, 2015)

8 first completions of “Barack Obama Says”:

Barack Obama Says It’s Wrong To Talk About IraqBarack Obama Says He’s Like ‘A Single Mother’ And ‘Over The Top’Barack Obama Says He Did 48 Things OverBarack Obama Says About Ohio LawBarack Obama Says He Is WrongBarack Obama Says He Will Get The American IdolBarack Obama Says Himself Are “Doing Well Around The World”Barack Obama Says As He Leaves Politics With His Wife

More on the website written in the references

Most of the generated sentences are grammatically correct and makesense.

Seoul National University Deep Learning March-June, 2018 50 / 56

Word Embedding + RNN

Word-rnn (Eidnes, 2015)

The model seems to understand the gender and political context.“Mary J. Williams On Coming Out As A Woman”“Romney Camp: ‘I Think You Are A Bad President’”

Updating W for only 2-layers works best.

Figure: From (Eidnes, 2015)

Seoul National University Deep Learning March-June, 2018 51 / 56

Conclusion

Conclusion

Seoul National University Deep Learning March-June, 2018 52 / 56

Conclusion

Summary

Embedding discrete words into Rd has interesting results

Similar word vectors has high-value of cosine-similiarity.

Linear relationships: “king” +”she” -”he” = ?

Embedded vectors can be used as an input or initial weights of deepneural network.

Seoul National University Deep Learning March-June, 2018 53 / 56

References

References

Seoul National University Deep Learning March-June, 2018 54 / 56

References

Key References

Goldberg, Y., & Levy, O. (2014). word2vec explained: Derivingmikolov et al.’s negative-sampling word-embedding method. arXivpreprint arXiv:1402.3722.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J.(2013). Distributed representations of words and phrases and theircompositionality. In Advances in neural information processingsystems (pp. 3111-3119).

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Globalvectors for word representation. In Proceedings of the 2014conference on empirical methods in natural language processing(EMNLP) (pp. 1532-1543).

Seoul National University Deep Learning March-June, 2018 55 / 56

References

Key References

Rong, X. (2014). word2vec parameter learning explained. arXivpreprint arXiv:1411.2738.

Eidnes, L. (2015). Auto-Generating Clickbait With Recurrent NeuralNetworks. [online] Lars Eidnes’ blog. Available at:https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/ [Accessed 8 May2018].

Seoul National University Deep Learning March-June, 2018 56 / 56

top related