recurrent neural network based language...

Recurrent Neural Network Based Language Model

Author: Toma ́sˇ Mikolov et. al Johns Hopkins University, USA

Presented by : Vicky Xuening Wang

ECS 289G, Nov 2015, UC Davis

1

Toma ́sˇ Mikolov1,2, Martin Karafia ́t1, Luka ́sˇ Burget1, Jan “Honza” Cˇernocky ́1, Sanjeev Khudanpur2

2

Language Model Tasks

• Statistical/Probabilistic Language Models

• Goal: compute the probability of a sentence or sequence of words:

• P(W) = P(w1,w2,w3,w4,w5...wn)

• Related task: predict probability of an upcoming word:

• P(wn|w1,w2,w3,w4,….wn-1)

3

Introduction - Language model

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

• Chain rule of probability

• Markov assumption

• N-gram model

4

Introduction


• Typical tasks: • Machine Translation

• P(high winds tonite) > P(large winds tonite)

• Spell Correction • P(about fifteen minutes from) > P(about fifteen minuets

from)

• Speech Recognition • P(I saw a van) >> P(eyes awe of an)

• Summarization, question-answering, etc

5

Introduction - LM tasks


6

Introduction - Bigram model


Maximum Likelihood Estimation

7

Introduction - Perplexity

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdfhttps://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Lower is better!

8

Introduction - WER

Lower is better!

• Recurrent Neural Network based language model (RNN-LM) outperforms standard backoff N-gram models

• Words are projected into low dimensional space, similar words are automatically clustered together.

• Smoothing is solved implicitly.

• Backpropagation is used for training.

9

Overview

10

Fixed-length

12

• Input layer x • Hidden/context layer s • Output layer y

13

Model Description - RNN Con’d

• RNN can be seen as a chain of NNs • Intimately related to sequences and lists. • In the last few years, RNN has been successfully applied

to : speech recognition, language modeling, translation, image captioning…

14

RNN v.s. FF

• Parameters to tune or selected:

• RNN

• Size of hidden layer

• FF

• size of layer that projects words to low dimensional space

• size of hidden layer

• size of context-length

15

RNN v.s. FF

• In feedforward networks, history is represented by context of N − 1 words - it is limited in the same way as in N-gram backoff models.

• In recurrent networks, history is represented by neurons with recurrent connections - history length is unlimited.

• Also, recurrent networks can learn to compress whole history in low dimensional space, while feedforward networks compress (project) just single word.

• Recurrent networks have possibility to form short term memory, so they can better deal with position invariance; feedforward networks cannot do that.

16

Comparison of modelsSimple experiment on 4M words from Switchboard corpus

60

70

80

90

100

PPL

73.5

80

85.1

93.7baseline

KN 5gram FF RNN 4*RNN+KN5

17

Model setting• Standard backpropogation algorithm + SGD

• Train in several epochs:

• α=0.1

• if log-likelihood of validation data increases

• continue

• else α=0.5α and continue

• terminate if no significant improvement

• Convergence usually reached at 10-20 epochs

19

Model setting- Optimization

• Rare token • merge all words occurring less often

than a threshold in training data to a uniformly distributed rare token

20

Experiments• WSJ (Source: read text only)

• training corpus consists of 37M words

• baseline KN5 - modified Kneser-Ney smoothed 5-gram

• RNN LM - select 6.4M words trained on 300K sentences

• combine 0.75 RNN+0.25 backoff Model

• NIST RT05 (115 hours of meeting speech + web data)

• more than1.3G words

• RNN LM select 5.4M words

21Best perplexity result is 112 for mixture of static and dynamic RNN LMs with larger learning rate 0.3

~50%! 18%

22

12% improvement

23

• RNNs are trained only on in-domain data(5.4M words)

• RT 05, RT 09 are trained on more than 1.3G words

24

Summary

• RNN LM is simple and intelligent.

• RNN LMs can be competitive with backoff LMs that are trained on much more data.

• Results show interesting improvements both for ASR and MT.

• Simple toolkit has been developed that can be used to train RNN LMs.

• This work provides clear connection between machine learning, data compression and language modeling.

25

Future work• Clustering of vocabulary to speed up training

• Parallel implementation of neural network training algorithm

• Online learning or dynamic model will be the future

• BPTT algorithm for a lot of training data

• Go beyond BPTT? LSTM

• Extended to OCR, data compression, cognitive sciences…

–Xuening

Thanks!

27

recurrent neural network based language...

Documents