deep learning for natural language processingmli/pre.pdf · 2015-07-08 · deep learning for...

59
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo [email protected] [email protected] July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 1 / 59

Upload: others

Post on 17-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Deep Learning for Natural Language Processing

Dylan Drover, Borui Ye, Jie Peng

University of Waterloo

[email protected]@uwaterloo.ca

July 8, 2015

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 1 / 59

Page 2: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Overview

1 Neural Networks: An “Intuitive” LookThe BasicsDeep LearningRBM and Language UnderstandingRNN and Statistical Machine Translation

2 Neural Probabilistic Language ModelWhy Use Distributed RepresentationBengio’s Neural Network

3 Google’s Word2VecCBOW+Hierarchical SoftmaxCBOW+Negative Sampling

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 2 / 59

Page 3: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Introduction to Neural Network Models

A Black Box with a Billion Dials

But we can do better...

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 3 / 59

Page 4: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

The Artificial Neuron

Base unit for (most) neural networks is a simplified version of abiological neuron

Neuron has a set of inputs which have associated weights, anactivation function which then determines whether a neuron will”fire” or be activated

Together these can form very complex functionmodellers/approximators

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 4 / 59

Page 5: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

The Artificial Neuron

Output y is some function of the sumof the weights and the inputs.

y = f(∑m

j=0 wkjxj

)Each neuron has a weight vector andeach layer has a weight matrix W

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 5 / 59

Page 6: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Multilayer Perceptron

Performs multiple logistic regressions at once for arbitrary functionapproximation

The multilayer perceptron which you might consider “deep” but isn’t.It suffers from the “vanishing gradient problem”

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 6 / 59

Page 7: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Backpropogation: Learning with Derivatives

The basic idea behind learning in a neural network is that a network willhave:

Objective function: J(θ), E (v , h) or training vectorI This applies for supervised and unsupervised learningI The network’s output is compared with objective to obtain an error

Optimization algorithm: Stochastic Gradient Descent, ContrastiveDivergence etc.

I These algorithms direct learning within the “weight space”

But how do we adjust weights to optimize the objective?

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 7 / 59

Page 8: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Chain Rule

Use the chain rule to determine how each output effects the finaldesired output

Now have many gradients that we can use for optimization

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 8 / 59

Page 9: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Backpropogation

Calculate network error (where, t is the target, y is the network output)

E = 12(t − y)2

∂E

∂wij=∂E

∂oj

∂oj∂netj

∂netj∂wij

∂netj∂wij

=∂

∂wij

(n∑

k=1

wkjxk

)= xi

∆wij = −α ∂E

∂wij

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 9 / 59

Page 10: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

High Dimensional Optimization Problem: Weight Space

Training a neural network is a N-dimensional optimization problem

Weight in a NN, is a dimension and the goal is to mind theminimum error (“hight”) with gradient descent

There are many optimization algorithms for finding weightsI Hessian-Free Optimization, Stochastic Gradient Descent, RMSProp,

AdaGrad, Momentum, etc.

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 10 / 59

Page 11: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

What is ”Deep Learning” and why should we care?

Deep Learning is just a re-branding of artificial neural networks

Multiple factors led to the new deeper NN architecturesI Pre-training (unsupervised)I Faster optimization algorithmsI Graphic Processing Units (GPU)

A myriad of techniques existed previously but things began to cometogether in the mid 2000s

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 11 / 59

Page 12: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Better Results Through Pre-Training

Train each layer of network greedily using other layers output asinput with unsupervised learning

Combine layers and use fine tune training as in MPL

Network starts with better position in weight space

MNIST error rates [Erhan et al., JMLR 2010]

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 12 / 59

Page 13: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Promise for NLP

Improved results for Part of Speech tagging and Named Entity Recognition

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 13 / 59

Page 14: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Application of Deep Belief Networks for Natural LanguageUnderstanding [Sarikaya et al.]

Sarikaya et al. attempt to solve the problem of spoken languageunderstanding (SLU) in the context of call-routing (call-centre data)

Data:I 27 000 transcribed utterances serve as unlabelled dataI Sets of 1K, 2K, 3K, 4K, 5K, 6K, 7K, 8K, 9K, 10K are used as labelled

data sets

Restricted Boltzmann Machines (RBM) were trained with unlabelleddata and then stacked to form a Deep Belief Network

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 14 / 59

Page 15: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Restricted Boltzmann Machine

Composed of visble and hidden neurons

Unlabelled training data is presented as v

Hidden neurons (stochastic binary units) and weights attempt toapproximate a joint probability distribution of data (GenerativeModel)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 15 / 59

Page 16: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Without KnowingJoint energy distribution of RBM is defined as:

E (v , h) = −∑i

aivi −∑

bihi −∑∑

viwi ,jhj

Which defines the probability distribution:

p(v,h) =1

Ze−E(v ,h)

Which is marginalized over over hidden vectors:

p(v) =1

Z

∑h

e−E(v ,h)

Z acts as a normalizing factor:

Z =∑v ,h

e−E(v ,h)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 16 / 59

Page 17: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning without Knowing

Training uses a “reconstruction” (the math is easier)of the input vectorthat is achieved in the hidden units with:

p(hj = 1|v) = σ(aj +∑i

viwij)

Which is used in:

p(vi = 1|h) = σ(bi +∑j

hjwij)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 17 / 59

Page 18: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning without Knowing

The hidden units compare their rough reconstruction to the actualinput.

Based on the reconstruction, the weights are changed to improve

∂logp(v)

∂wij= 〈vihj〉v − 〈vihj〉model

∆wij ∝ 〈vihj〉data − 〈vihj〉model

∆wij ∝ 〈vihj〉data − 〈vihj〉recon

This is a rough approximation, however it works well in practice.

*(Angle brackets are the expectation with respect to the subscriptdistribution)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 18 / 59

Page 19: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Visualization of Weights from a RBM

Each square is a set of weights for one neuron

Features emerge from unsupervised learning

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 19 / 59

Page 20: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Deep Belief Network

Use one RBM to train the next layer RBMI Each layer learns featuresI Each successive layer learns features of features

This continues until a supervised layer

Results in a Deep Belief Network

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 20 / 59

Page 21: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Results

Results show improvements from unsupervised learning in all aspects

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 21 / 59

Page 22: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Autoencoders

Creates compressedrepresentation of data(Dimensionality reduction)

Central layer acts as anon-linear principalcomponent analysis

Decoder weights are transposedencoder weight matrices: Wli

and W Tli

Each layer trained greedily(similar to DBN layers)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 22 / 59

Page 23: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Recurrent Neural Networks

Input is treated as time series:..., xt−1, xt , xt+1, ...

Retain temporal context orshort term memory

Trained with backpropogationthrough time (BPTT)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 23 / 59

Page 24: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Long Short Term Memory

Prevent the “vanishinggradient” problem

Can retain informationfor arbitrary amount oftime

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 24 / 59

Page 25: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Phrase Representations using RNNEncoderDecoder for Statistical Machine Translation [Choet al.]

Combination of a RNN with an autoencoder

Encoder maps variable length source phrase X = (x1, x2, ..., xN)(English sentence) into a fixed length internal representation vector c

The decoder then maps this back into another variable lengthsequence, the target sentence Y = (y1, y2, ..., yN) (French Sentence)

Analysis showed that the internal representation (in the 1000 hiddenunits) preserved syntactic and semantic information

Learns probability distribution:

p(y1, ..., yT ′ |x1, ..., xT )

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 25 / 59

Page 26: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Phrase Representations using RNNEncoderDecoder for Statistical Machine Translation

The decoder uses the compressed semantic meaning vector c as wellas the previous word in its translation

h〈t〉 = f (h〈t−1〉, yt−1, c)

Next element in translated sequence is conditioned on:

P(yt |yt−1, ..., y1, c) = g(h〈t−1〉, yt−1, c)

Training of the network attempts to maximize the conditional loglikelihood of:

maxθ

1

N

N∑n=1

log pθ(yn, xn)

where θ are the model parameters (weights) and (yn, xn) are inputand output pairs.

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 26 / 59

Page 27: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Encoder - Decoder

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 27 / 59

Page 28: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Big Data

The training of the model used:I Bilingual corpora Europarl (61M words)I News commentary (5.5M words)I UN transcriptions (421M words)I 870M words from crawled corpora

However to optimize results only a subset of 348M words for trainingtranslation

Final BLEU scores for the model were 34.54

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 28 / 59

Page 29: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Phrase Representations using RNNEncoderDecoder for Statistical Machine Translation

This model is not restricted to just translation (which is why ANN areso useful and exciting)

This model also created semantic relations between similar wordsand sentences from their continuous vector space representations(more on that later)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 29 / 59

Page 30: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Results

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 30 / 59

Page 31: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Results

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 31 / 59

Page 32: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Results

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 32 / 59

Page 33: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Results

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 33 / 59

Page 34: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Neural Probabilistic Language Model

Presenter: Borui YePapers:

1 Efficient Estimation of Word Representations in Vector Space

2 A Neural Probabilistic Language Model

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 34 / 59

Page 35: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

What is Word Vector

One-hot representation: represents a word using a long vector. Forexample:“microphone” is represented as : [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...]“phone” is represented as : [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...]

PROS: this method can coordinate well with max entropy, SVM,CRF algorithm.

CONS: word gap

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 35 / 59

Page 36: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

What is Word Vector (Cont.)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 36 / 59

Page 37: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

How To Train Word Vector

Language Model In practice, we usually need to calculate the probabilityof a sentence:

P(S) = p(w1,w2,w3,w4,w5, ...,wn)

= p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2, ...,wn−1)

Markov’s Assumption Each word depends only on the last n − 1 words.

P(S) = p(w1,w2,w3,w4,w5, ...,wn)

= p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2, ...,wn−1)

≈ p(w1)p(w2|w1)p(w3|w2)...p(wn|wn−1) (bigram)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 37 / 59

Page 38: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

How To Train Word Vector (Cont.)

Problems With N-gram Model:

It is not taking into account contexts farther than 1 or 2 words

Cannot capture the similarities among words.

Example:The cat is walking in the bedroom

A dog was running in a room

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 38 / 59

Page 39: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Bengio’s Neural Network

Training Set a sequence w1...wT of words wt ∈ V , where the vocabularyV is a large but finite set.Objective learn a model : f (wt , ...,wt−n+1) = P̂(wt |w t−1

1 )

Constraint∑|V |

i=1 f (i ,wt−1, ...,wt−n+1) = 1, with f > 0

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 39 / 59

Page 40: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Bengio’s Neural Network (Cont.)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 40 / 59

Page 41: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

ParametersDefinitionsC : a shared word vector matrix, C ∈ R|V |∗mx : vector of hidden layer, x = (C (wt−1),C (wt−2), ...,C (wt−n+1))y : vector of output layer, y = b + Wx + Utanh(d + Hx)P(wt = i |context) = P̂(wt |wt−1, ...,wt−n+1) = eywt∑

i eyi

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 41 / 59

Page 42: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Parameter EstimationThe goal is to find the parameter that maximized the training corpuspenalized log-likelihood: L = 1

T

∑t log f (wt ,wt−1, ...,wt−n+1; θ) + R(θ)

where θ = (b, d ,W ,U,H,C )

SGD: θ ← θ + ε∂P̂(wt |wt−1,...,wt−n+1)∂θ

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 42 / 59

Page 43: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Google’s Word2Vec

Project url : http://code.google.com/p/word2vec/

Feature : Additive Compositionality :vector(‘Paris’) - vector(‘France’) + vector(’Italy’) ≈ vector(’Rome’)vector(‘king’) - vector(‘man’) + vector(’woman’) ≈ vector(‘queen’)

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 43 / 59

Page 44: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Google’s Word2Vec (Cont.)

./distance

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 44 / 59

Page 45: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Two Models and Two Algorithms

Models Continuous Bag of Words Skip-gram

Alg.Hierarchical

SoftmaxNegativeSampling

HierarchicalSoftmax

NegativeSampling

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 45 / 59

Page 46: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

CBOW+Hierarchical Softmax

Predict the probability of a word given its context:

P(w |Context(w))

Learning objective : maximize log-likelihood:

ζ =∑w∈C

log p(w |Context(w))

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 46 / 59

Page 47: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

CBOW+Hierarchical Softmax (Cont.)

Input Layer 2c word vectors in Context(w) :v(Context(w)1), v(Context(w)2), ...v(Context(w)2c) ∈ Rm

Projection Layer Adding all the vectors in input layer:

xw =2c∑i=1

v(Context(w)i ) ∈ Rm

Output Layer A Huffman tree using words in vocabulary as leaves.

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 47 / 59

Page 48: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

CBOW+Hierarchical Softmax (Cont.)

Notations

pw : Path from root to corresponding leaf w

lw : Number of nodes included in pw

pw1 , pw2 , ..., p

wlw : lw nodes of path pw

dw2 , d

w3 , ..., d

wlw ∈ {0, 1} : Huffman code of each node on path pw ,

root does not have code

θw1 , θw2 , ..., θ

wlw : vector of each node on path pw ,

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 48 / 59

Page 49: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Huffman Tree

I

love watching Brazil

football game

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 49 / 59

Page 50: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Objective

We assign every node a label:

Lable(pwi ) = 1− dwi , i = 2, 3, ..., lw

So the probability of a node being classified as positive label is :

δ(xTw σ) =1

1 + e−xTw θ

Then:

p(w |Context(w)) =lw∏j=2

p(dwj |xw , θwj−1)

where

p(dwj |xw , θwj−1) =

{σ(xTw θ

wj−1), dw

j = 0;

1− σ(xTw θwj−1), dw

j = 1;

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 50 / 59

Page 51: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Objective

Full learning objective is to maximize:

ζ =∑w∈C

loglw∏j=2

{[σ(xTw θwj−1)]1−d

wj [1− σ(xTw θ

wj−1)]d

wj }

=∑w∈C

lw∑j=2

{(1− dwj ) log[σ(xTw θ

wj−1)] + dw

j log[1− σ(xTw θwj−1)]}

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 51 / 59

Page 52: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

CBOW+Negative Sampling

In a nutshell, it doesn’t have Huffman tree in the output layer, but a set ofnegative samples instead (Given Context(w), word w is positive, whileothers are negative). Negative samples are randomly selected.Assume that we have had a negative sample set NEG (w) 6= Φ, ∀w̃ ∈ D,we denote the label of w as follows:

Lw (w̃) =

{1, w̃ = w ;

0, w̃ 6= w ;

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 52 / 59

Page 53: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

CBOW+Negative Sampling (Cont.)

Given (Context(w),w) our goal is to maximize:

g(w) =∏

u∈w⋃

NEG(w)

p(u|Context(w))

where

p(u|Context(w)) =

{σ(xTw θ

u), Lw (u) = 1;

1− σ(xTw θu), Lw (u) = 0;

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 53 / 59

Page 54: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

CBOW+Negative Sampling (Cont.)

To increase the probability of positive sample and decrease negative ones:

g(w) =∏

u∈w⋃

NEG(w)

p(u|Context(w))

= σ(xTw θw )

∏u∈NEG(w)

(1− σ(xTw θw ))

Then:G =

∏w∈C

= g(w)

where C is the corpus.

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 54 / 59

Page 55: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Learning Objective

Full learning objective: maximize the following:

ζ = logG = log∏w∈C

g(w) =∑w∈C

log g(w)

=∑w∈C

log∏

u∈{w}⋃

NEG(w)

{[σ(xTw θu)]L

w (u)[1− σ(xTw θu)](1−L

w (u))}

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 55 / 59

Page 56: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Difference Between CBOW and Skip-gram

Skip-gram is more accurate.

Skip-gram is slower given larger context.

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 56 / 59

Page 57: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Why Use Negative Sampling & Hierachical Softmax

I

love watching Brazil

football game

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 57 / 59

Page 58: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

References

Sarikaya, Ruhi, Geoffrey E. Hinton, and Anoop Deoras. ”Application of deep beliefnetworks for natural language understanding.” Audio, Speech, and LanguageProcessing, IEEE/ACM Transactions on 22.4 (2014): 778-784.

Cho, Kyunghyun, et al. ”Learning phrase representations using rnn encoder-decoderfor statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).

http://nlp.stanford.edu/courses/NAACL2013/

NAACL2013-Socher-Manning-DeepLearning.pdf

http://devblogs.nvidia.com/parallelforall/

introduction-neural-machine-translation-gpus-part-2/

http://blog.csdn.net/itplus/article/details/37969979

http://licstar.net/archives/328

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 58 / 59

Page 59: Deep Learning for Natural Language Processingmli/pre.pdf · 2015-07-08 · Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca

Thank you!

Dylan Drover, Borui Ye, Jie Peng (University of Waterloo) NN for NLP July 8, 2015 59 / 59