language modeling + feed-forward networks 3 · 2017-02-10 · interp(wjc) = lp ml(wjc)+(1 l)p...
TRANSCRIPT
![Page 1: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/1.jpg)
Language Modeling
+
Feed-Forward Networks 3
CS 287
![Page 2: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/2.jpg)
Review: LM ML Setup
Multi-class prediction problem,
(x1, y1), . . . , (xn, yn)
I yi ; the one-hot next word
I xi ; representation of the prefix (w1, . . . ,wt−1)
Challenges:
I How do you represent input?
I Smoothing is crucially important.
I Output space is very large (next class)
![Page 3: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/3.jpg)
Review: Perplexity
Previously, used accuracy as a metric.
Language modeling uses of version average negative log-likelihood
I For test data w1, . . . , wn
I
NLL = −1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1)
Actually report perplexity,
perp = exp(−1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1))
Requires modeling full distribution as opposed to argmax (hinge-loss)
![Page 4: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/4.jpg)
Review: Interpolation (Jelinek-Mercer Smoothing)
Can write recursively,
pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)
Ensure that λ form convex combination
0 ≤ λ ≤ 1
How do you learn conjunction combinations?
![Page 5: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/5.jpg)
Quiz
Assume we have seen the following training sentences,
I a tractor drove slow
I the red tractor drove fast
I the parrot flew fast
I the parrot flew slow
I the tractor slowed down
Compute pML for bigrams and use them to estimate whether parrot or
tractor fit better in the following contexts.
1. the red ?
2. the ?
3. the drove?
![Page 6: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/6.jpg)
Answer I
a tractor 1
the red 14
the parrot 12
the tractor 14
red tractor 1
tractor drove 23
tractor slowed 13
parrot flew 1
. . .
![Page 7: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/7.jpg)
Answer II
I the red tractor
I the parrot
I the tractor drove
![Page 8: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/8.jpg)
Today’s Class
p(wi |wi−n+1, . . .wi−1; θ)
I Estimate this directly as a neural network.
I Two types of models, neural network and log-bilinear.
I Efficient methods for approximated estimation.
![Page 9: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/9.jpg)
Intuition: NGram Issues
In training we might see,
the arizona corporations commission authorized
But at test we see,
the colorado businesses organization
I Does this training example help here?
I Not really. No count overlap.
I Does backoff help here?
I Maybe, if we have seen organization.
I Mostly get nothing from the earlier words.
![Page 10: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/10.jpg)
Intuition: NGram Issues
In training we might see,
the arizona corporations commission authorized
But at test we see,
the colorado businesses organization
I Does this training example help here?
I Not really. No count overlap.
I Does backoff help here?
I Maybe, if we have seen organization.
I Mostly get nothing from the earlier words.
![Page 11: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/11.jpg)
Intuition: NGram Issues
In training we might see,
the arizona corporations commission authorized
But at test we see,
the colorado businesses organization
I Does this training example help here?
I Not really. No count overlap.
I Does backoff help here?
I Maybe, if we have seen organization.
I Mostly get nothing from the earlier words.
![Page 12: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/12.jpg)
Goal
I Learn representations that share properties between similar words.
I Particularly helpful for unseen contexts.
I Not a silver bullet, e.g. proper nouns
the eagles play the arizona diamondbacks
Whereas at test we might see,
the eagles play the colorado
(We will discuss this issue more for in MT)
![Page 13: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/13.jpg)
Baseline: Class-Based Language Models
I Groups words into classes based on word-context.
5
. . .motorcycletruckcar
. . .3
. . .horsecatdog
. . .
I Various factorization methods for estimating with count-based
approaches.
I However, assumes a hard-clustering, often estimated separately.
![Page 14: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/14.jpg)
Contents
Neural Language Models
Noise Contrastive Estimation
![Page 15: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/15.jpg)
Recall: Word Embeddings
I Embeddings give multi-dimensional representation of words.
I Ex: Closest by cosine similarity
arizona
texas 0.932968706025
florida 0.932696958878
kansas 0.914805968271
colorado 0.904197441085
minnesota 0.863925347525
carolina 0.862697751337
utah 0.861915722889
miami 0.842350326527
oregon 0.842065064748
I Gives a multi-clustering over words.
![Page 16: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/16.jpg)
Feed-Forward Neural NNLM (Bengio, 2003)
I wi−n+1, . . .wi−1 are input embedding representations
I wi is an output embedded representation
I Model simultaneously learns,
I input word representations
I output word representations
I conjunctions of input words (through NLM, no n-gram features)
![Page 17: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/17.jpg)
Feed-Forward Neural Representation
I p(wi |wi−n+1, . . .wi−1; θ)
I f1, . . . , fdwinare words in window
I Input representation is the concatenation of embeddings
x = [v(f1) v(f2) . . . v(fdwin)]
Example: NNLM (dwin = 5)
[w3 w4 w5 w6 w7] w8
x = [v(w3) v(w4) v(w5) v(w6) v(w7)]
din/5 din/5
xdin/5 din/5 din/5
![Page 18: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/18.jpg)
A Neural Probabilistic Language Model (Bengio, 2003)
One hidden layer multi-layer perceptron architecture,
NNMLP1(x) = tanh(xW1 + b1)W2 + b2
Neural network architecture on top of concat.
y = softmax(NNMLP1(x))
Best model uses din = 30× dwin, dhid = 100.
![Page 19: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/19.jpg)
A Neural Probabilistic Language Model
Optional, direct connection layers,
NNDMLP1(x) = [tanh(xW1 + b1), x]W 2 + b2
I W1 ∈ Rdin×dhid ,b1 ∈ R1×dhid ; first affine transformation
I W2 ∈ R(dhid+din)×dout ,b2 ∈ R1×dout ; second affine transformation
![Page 20: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/20.jpg)
A Neural Probabilistic Language Model (Bengio, 2003)
Dashed-lines show the optional direct connections, C = v .
![Page 21: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/21.jpg)
A Neural Probabilistic Language Model
![Page 22: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/22.jpg)
Parameters
I Bengio NNLM has dhid = 100, dwin = 5, din = 5× 50
I In-Class: How many parameters does it have? How does this
compare to Kneser-Ney smoothing?
![Page 23: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/23.jpg)
Historical Note
I Bengio et al notes that many of these aspects predate the work
I Furthermore proposes many of the ideas that Collobert et al. and
word2vec implement and scale
I Around this time, very few NLP papers on NN, most-cited papers
are about conditional random fields (CRFs).
![Page 24: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/24.jpg)
Log-Bilinear Language Model (Mnih & Hinton, 2007)
Slightly different input representation. Now let:
x =dwin
∑i=1
v(fi )Ci
I Instead of concatenating, weight each v(fi ) by position-specific
weight matrix Ci .
Then use:
y = softmax(xW1 + b)
I Note no tanh layer.
I W1 can use input embeddings too, or not (Mnih and Teh, 2012)
I Can be faster to use, and in some cases simpler.
![Page 25: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/25.jpg)
Comparison
Both count-based models and feed-forward NNLMs are Markovian
language models,
Comparison:
I Training Speed: ngrams are much faster (more coming)
I Usage Speed: ngrams very fast, NN can be fast with some tricks.
I Memory: NN models can be much smaller (but there are big ones)
I Accuracy: Comparable for small data, NN does better with more.
Advantages of NN model
I Can be trained end-to-end.
I Does not require smoothing methods.
![Page 26: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/26.jpg)
Translation Performance ( and Blunsom, 2015)
![Page 27: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/27.jpg)
Contents
Neural Language Models
Noise Contrastive Estimation
![Page 28: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/28.jpg)
Review: Softmax Issues
Use a softmax to force a distribution,
softmax(z) =exp(z)
∑w∈C
exp(zw )
log softmax(z) = z− log ∑w∈C
exp(zw )
I Issue: class C is huge.
I For C&W, 100,000, for word2vec 1,000,000 types
I Note largest dataset is 6 billion words
![Page 29: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/29.jpg)
Unnormalized Scores
Recall the score defined as (dropping bias)
z = tanh(xW1)W2
Unnormalized score of each word before soft-max,
zj = tanh(xW1)W2∗,j
for any j ∈ {1, . . . dout}
Note: can be computed efficiently O(1) versus O(dout).
![Page 30: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/30.jpg)
Coherence
I Saw similar idea earlier for ranking embedding.
I Idea: Learn to distinguish coherent n-grams from corruption.
I Want to discriminate correct next words from other choices.
[ the dog walks ]
[ the dog house ]
[ the dog cats ]
[ the dog skips ]
![Page 31: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/31.jpg)
Warm-Up
Imagine we have a new dataset,
((x1, y1),d1), . . . , ((xn, yn),dn),
I x; representation of context wi−n+1, . . .wi−1
I y; a possible wi
I d ; 1 if y is correct, 0 otherwise
Objective is based on predicted d :
L(θ) = ∑i
Lcrossentropy (di , di )
![Page 32: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/32.jpg)
Warm-Up: Binary Classification
How do we score (xi , yi = δ(w))?
Could use unnormalized score,
zw = tanh(xW1)W2∗,c
Becomes softmax regression/non-linear logistic regression,
d = σ(zw )
I Much faster
I But does not help us train LM.
![Page 33: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/33.jpg)
Implementation
Standard MLP language model, (only takes in x)
x⇒W1 ⇒ tanh⇒W2 ⇒ softmax
Computing binary (takes in x and y)
d = σ(zw )
x⇒W1 ⇒ tanh⇒·
W2∗,w (Lookup)
⇒ σ
![Page 34: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/34.jpg)
Noise Contrastive Estimation 1
Probabilistic model,
I Introduce random variable D
I If D = 1 produce true sample
I If D = 0 produce sample from a noise distribution.
I Hyperparameter K is ratio of noise
p(D = 1) =1
K + 1
p(D = 0) =K
K + 1
![Page 35: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/35.jpg)
Noise Contrastive Estimation 2
For a given x, y,
p(D = 1|x, y) =p(y|D = 1, x)p(D = 1|x)
∑d p( y|D = d , x)p(D = d |x)
=p(y|D = 1, x)p(D = 1|x)
p(x|D = 0)p(D = 0|x) + p(y|D = 1, x)p(D = 1|x)
Plug-in the noise distribution and hyperparameters,
p(D = 1|x, y) =1
K+1p(y|D = 1, x)1
K+1p(y|D = 1, x) + KK+1p(y|D = 0, x)
=p(y|D = 1, x)
p(y|D = 1, x) +Kp(y|D = 0, x)
= σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))
![Page 36: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/36.jpg)
Noise Contrastive Estimation 3
With
p(D = 1|x, y) = σ(log p(y|D = 1, x)− log(Kp(y|D = 0, x)))
we the training objective for a corpus that has K noise samples si ,k per
example is:
L(θ) = ∑i
log p(D = 1|xi , yi ) +K
∑k=1
log p(D = 0|xi ,Y = si ,k)
= ∑i
log σ (log p(yi |D = 1, xi )− log(Kp(yi |D = 0, xi )))
+K
∑k=1
log (1− σ (log p(si ,k |D = 1, xi )− log(Kp(si ,k |D = 0, xi ))))
I In practice, sample si ,k from unigram distribution
![Page 37: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/37.jpg)
Noise Contrastive Estimation 4
But we still have a problem: L defined in terms of normalized
distributions log p(y|D = 1, x)
Solution:
I instead of explicitly normalizing, estimate Z (x), normalizing
constant of each context x, as a parameter (Gutmann &
Hyvarinen, 2010)
I Mnih and Teh (2012) show that fixing Z (x) = 1 for all contexts
works just as well
I So we can replace log p(y = δ(w)|D = 1, x) with zw , as computed
by our network
![Page 38: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/38.jpg)
Noise Contrastive Estimation 5
So we now have
L(θ) = ∑i
log σ(zwi − log(KpML(wi )))
+K
∑k=1
log(1− σ(zsi ,k − log(KpML(si ,k))))
I Mnih and Teh (2012) show that gradient of L approaches gradient
of true language model’s log-likelihood objective as k → ∞.
![Page 39: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/39.jpg)
Implementation
I How do you efficiently compute zw?
Need a lookup table (and dot-product) for output embeddings!
(Not full matrix-vector product).
I How do you efficiently handle log pML(w)
Can be precomputed or placed in a lookuptable .
I How do you handle sampling?
Can precompute large number of samples (not example specific).
I How do you handle loss?
Simply BinaryNLL Objective.
![Page 40: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/40.jpg)
Implementation
Standard MLP language model,
x⇒W1 ⇒ tanh⇒W2 ⇒ softmax
Computing σ(zw − log(KpML(w))),
x⇒W1 ⇒ tanh⇒·
W2∗,w (Lookup)
⇒−
logKpML(w)(input)⇒ σ
(Efficiency, compute first three layers only once for K + 1)
![Page 41: Language Modeling + Feed-Forward Networks 3 · 2017-02-10 · interp(wjc) = lp ML(wjc)+(1 l)p interp(wjc0) Ensure that l form convex combination 0 l 1 How do you learn conjunction](https://reader034.vdocument.in/reader034/viewer/2022042915/5f515cb5e5f918157102d83c/html5/thumbnails/41.jpg)
Using in Practice
Several options for test time,
I Use full softmax with learned parameters.
I Compute subset of scores and renormalize (homework) .
I Can sometimes just use treat unormalized params as being
normalized (self-normalization)