recurrent neural network based language...
TRANSCRIPT
![Page 1: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/1.jpg)
Recurrent Neural Network Based Language Model
Author: Toma ́sˇ Mikolov et. al Johns Hopkins University, USA
Presented by : Vicky Xuening Wang
ECS 289G, Nov 2015, UC Davis
1
Toma ́sˇ Mikolov1,2, Martin Karafia ́t1, Luka ́sˇ Burget1, Jan “Honza” Cˇernocky ́1, Sanjeev Khudanpur2
![Page 2: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/2.jpg)
2
Language Model Tasks
![Page 3: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/3.jpg)
• Statistical/Probabilistic Language Models
• Goal: compute the probability of a sentence or sequence of words:
• P(W) = P(w1,w2,w3,w4,w5...wn)
• Related task: predict probability of an upcoming word:
• P(wn|w1,w2,w3,w4,….wn-1)
3
Introduction - Language model
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
![Page 4: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/4.jpg)
• Chain rule of probability
• Markov assumption
• N-gram model
4
Introduction
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
![Page 5: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/5.jpg)
• Typical tasks: • Machine Translation
• P(high winds tonite) > P(large winds tonite)
• Spell Correction • P(about fifteen minutes from) > P(about fifteen minuets
from)
• Speech Recognition • P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering, etc
5
Introduction - LM tasks
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
![Page 6: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/6.jpg)
6
Introduction - Bigram model
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
Maximum Likelihood Estimation
![Page 7: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/7.jpg)
7
Introduction - Perplexity
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdfhttps://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
Lower is better!
![Page 8: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/8.jpg)
8
Introduction - WER
Lower is better!
![Page 9: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/9.jpg)
• Recurrent Neural Network based language model (RNN-LM) outperforms standard backoff N-gram models
• Words are projected into low dimensional space, similar words are automatically clustered together.
• Smoothing is solved implicitly.
• Backpropagation is used for training.
9
Overview
![Page 10: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/10.jpg)
10
Fixed-length
![Page 11: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/11.jpg)
11
![Page 12: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/12.jpg)
12
• Input layer x • Hidden/context layer s • Output layer y
![Page 13: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/13.jpg)
13
Model Description - RNN Con’d
• RNN can be seen as a chain of NNs • Intimately related to sequences and lists. • In the last few years, RNN has been successfully applied
to : speech recognition, language modeling, translation, image captioning…
![Page 14: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/14.jpg)
14
RNN v.s. FF
• Parameters to tune or selected:
• RNN
• Size of hidden layer
• FF
• size of layer that projects words to low dimensional space
• size of hidden layer
• size of context-length
![Page 15: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/15.jpg)
15
RNN v.s. FF
• In feedforward networks, history is represented by context of N − 1 words - it is limited in the same way as in N-gram backoff models.
• In recurrent networks, history is represented by neurons with recurrent connections - history length is unlimited.
• Also, recurrent networks can learn to compress whole history in low dimensional space, while feedforward networks compress (project) just single word.
• Recurrent networks have possibility to form short term memory, so they can better deal with position invariance; feedforward networks cannot do that.
![Page 16: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/16.jpg)
16
Comparison of modelsSimple experiment on 4M words from Switchboard corpus
60
70
80
90
100
PPL
73.5
80
85.1
93.7baseline
KN 5gram FF RNN 4*RNN+KN5
![Page 17: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/17.jpg)
17
Model setting• Standard backpropogation algorithm + SGD
• Train in several epochs:
• α=0.1
• if log-likelihood of validation data increases
• continue
• else α=0.5α and continue
• terminate if no significant improvement
• Convergence usually reached at 10-20 epochs
![Page 18: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/18.jpg)
18
![Page 19: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/19.jpg)
19
Model setting- Optimization
• Rare token • merge all words occurring less often
than a threshold in training data to a uniformly distributed rare token
![Page 20: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/20.jpg)
20
Experiments• WSJ (Source: read text only)
• training corpus consists of 37M words
• baseline KN5 - modified Kneser-Ney smoothed 5-gram
• RNN LM - select 6.4M words trained on 300K sentences
• combine 0.75 RNN+0.25 backoff Model
• NIST RT05 (115 hours of meeting speech + web data)
• more than1.3G words
• RNN LM select 5.4M words
![Page 21: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/21.jpg)
21Best perplexity result is 112 for mixture of static and dynamic RNN LMs with larger learning rate 0.3
~50%! 18%
![Page 22: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/22.jpg)
22
12% improvement
![Page 23: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/23.jpg)
23
• RNNs are trained only on in-domain data(5.4M words)
• RT 05, RT 09 are trained on more than 1.3G words
![Page 24: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/24.jpg)
24
Summary
• RNN LM is simple and intelligent.
• RNN LMs can be competitive with backoff LMs that are trained on much more data.
• Results show interesting improvements both for ASR and MT.
• Simple toolkit has been developed that can be used to train RNN LMs.
• This work provides clear connection between machine learning, data compression and language modeling.
![Page 25: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/25.jpg)
25
Future work• Clustering of vocabulary to speed up training
• Parallel implementation of neural network training algorithm
• Online learning or dynamic model will be the future
• BPTT algorithm for a lot of training data
• Go beyond BPTT? LSTM
• Extended to OCR, data compression, cognitive sciences…
![Page 26: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/26.jpg)
26
![Page 27: Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,](https://reader030.vdocument.in/reader030/viewer/2022020411/5a7fa68b7f8b9a9d308ba365/html5/thumbnails/27.jpg)
–Xuening
Thanks!
27