language modeling with deep transformersirie/... · de ne newstate-of-the-artresults [luscher &...
Post on 13-May-2020
1 Views
Preview:
TRANSCRIPT
Language Modeling with Deep Transformers
Kazuki Irie, Albert Zeyer, Ralf Schluter, Hermann Ney
Human Language Technology and Pattern Recognition GroupRWTH Aachen University, Aachen, Germany
INTERSPEECH 2019, Graz, AustriaNeural Networks for Language Modeling [Thu-O-10-1], September 19, 2019
Introduction
• 2017: Advent of Transformer [Vaswani & Shazeer+ 17] in NLP/beyond.
• Originally an encoder-decoder model for machine translation.
• Decoder component: language model
– Early work in text generation (5 layers) [Liu & Saleh+ 18] ICLR 2018
• Gain in popularity more recently:
– Google 64-layer Transformer character LM [Al-Rfou & Choe+ 19] AAAI 2019
– OpenAI GPT-2 LM (48 layers) [Radford & Wu+ 19] Blog February 2019
• Large scale language model pre-training at the center of interest in NLP.
– Nvidia, Megatron LM (72 layers) Blog August 2019
– Salesforce, Controllable Transformer LM (48 layers) Last week!
2 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Contributions of this work
• Application of Transformer language models to ASR
– Successful training of deep and powerful Transformer language models.– Evaluation in both hybrid and attention based end-to-end ASR.
– Large improvements over the state-of-the-art LSTM LM.
• Comprehensive hyper-parameter tuning
– Crucial for studying a new model.– In particular for Transformers which have lots of hyper-parameters.
• Demonstration of an LM specific property of Transformers
– LM task automatically provides positional information: No need for extra signal.
• Analysis and visualization
• Release of model configurations and checkpoints (link in the paper)https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers
Open-source toolkit RETURNN [Zeyer & Alkhouli+ 18]
3 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Transformer Language Model
• Stack L layers; each consisting of self-attention and feed-forward modules.• Apply residual connections and layer normalization across modules.
• Self-attention typically has multiple attention heads.
Self-Attention
LayerNorm
Feed-forward
LayerNorm
PositionalEncoding
4 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Experimental Setups
LibriSpeech dataset [Panayotov & Chen+ 15].• 960h audio, read speech transcriptions.• Large LM task: 200K vocab, 800M-word extra textual training data.
Language modeling for speech recognition in 2 settings:
• Word-level models for conventional hybrid HMM/NN system by latticerescoring [Sundermeyer & Tuske+ 14].Push-forwarding Transformer states instead LSTM states.
• BPE subword-level models for end-to-end attention based system byshallow fusion [Gulcehre & Firat+ 17, Toshniwal & Kannan+ 18].
Intensive tuning of the baseline LSTM LM [Sundermeyer & Schluter+ 12]
• All tuning details provided in the paper.• Wide model gave the best results: 2 layers with 4096 LSTM nodes.• Rel. improvements in PPL over 4-gram of about 58%.
5 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Optimization of Transformer models
Exhaustive list of hyper-parameters is long.• Number of layers & dimension of the residual connection.• (Dimension of input word embeddings).• For each layer: number of attention heads, dimension of the key and query,
dimension of the value, and dimension of the feed-forward layer.
To reduce this complexity,• Use the same dimension for key, query, value, and the residual connection.• Use the same dimensionality across all layers.
4 hyper-parameters to describe all our models.
• Number of layers L.• Feed-forward dimension dff .• Residual dimension dres.• Number of attention heads H .
6 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Effect of Depth and Width (Highlight)
Perplexity after 2.5 epoch (H = 8, dres = 512).
L dffParams. Perplexity
in M Train Dev
122,048
243 67.6 67.124 281 62.2 62.342 338 59.0 59.6
6 8,192 262 66.7 66.712 4,096 268 63.5 63.8
416,384 277 67.6 67.432,768 344 65.4 68.4
• For a given parameter budget, deep models tend to perform better.
Full tuning details in the paper!• Effect of number of heads: helps up to 16 ! 8 is already good.• Effect of activation ReLU, GeLU, GLU: the standard ReLU is fine!• Parameter tying (Universal Transformers): improvements w/o extra params!
7 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Optimization of Transformer models: Final results
Further scaling up:
Best model: 96-layer model (L = 96, dff = 2048, dres = 512, H = 8)
(112-layer model even got slightly better after camera-ready deadline.)
Final perplexity on LibriSpeech 200K vocab word level.
LMParam. Perplexity
in M Dev Test4-gram 230 146 152LSTM 1048 60 63Transformer 431 54 56
Large improvements over the highly optimized LSTM LM:
• About 11% relative improvements in PPL.
8 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Do we need extra positional encoding in Tranformer LM?
• Amount of information increases at each time step in LM: position signal?• Our finding: External positional encoding unnecessary.
– Even slight improvements in perplexity w/o positional encoding.
• Attention in the first layer (all 8 heads per target word position shown)
Tar
get
wor
d
<bos
> soth
eywe
nt on to the
vera
ndah and
look
eddo
wnup
on the
light
s of the
priso
nan
dlis
tene
d to the
sea
lapp
ing
the
shor
e
sotheywent
onto
theverandah
andlookeddownupon
thelights
ofthe
prisonand
listenedto
thesea
lappingthe
shore<eos>
With positional encoding
<bos
> soth
eywe
nt on to the
vera
ndah and
look
eddo
wnup
on the
light
s of the
priso
nan
dlis
tene
d to the
sea
lapp
ing
the
shor
e
sotheywent
onto
theverandah
andlookeddownupon
thelights
ofthe
prisonand
listenedto
thesea
lappingthe
shore<eos>
Without positional encoding
Input word:
9 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Other Layers: 3 categories
Analysis for the 24-layer model which is also valid for deeper models.
• There are 3 functional groups of layers.
<bos
> soth
eywe
nt on to the
vera
ndah and
look
eddo
wnup
on the
light
s of the
priso
nan
dlis
tene
d to the
sea
lapp
ing
the
shor
e
sotheywent
onto
theverandah
andlookeddownupon
thelights
ofthe
prisonand
listenedto
thesea
lappingthe
shore<eos>
<bos
> soth
eywe
nt on to the
vera
ndah and
look
eddo
wnup
on the
light
s of the
priso
nan
dlis
tene
d to the
sea
lapp
ing
the
shor
e
sotheywent
onto
theverandah
andlookeddownupon
thelights
ofthe
prisonand
listenedto
thesea
lappingthe
shore<eos>
<bos
> soth
eywe
nt on to the
vera
ndah and
look
eddo
wnup
on the
light
s of the
priso
nan
dlis
tene
d to the
sea
lapp
ing
the
shor
e
sotheywent
onto
theverandah
andlookeddownupon
thelights
ofthe
prisonand
listenedto
thesea
lappingthe
shore<eos>
Bottom layers (2-3):“Blur”
≈ Average over all positions;
Bag-of-words. Global info.
Some heads focus on difficult
words, here verandah.
Mid layers (4-9):“Window”
Focus on the local n-gram.
Top layers (10-24):
“Structured”
Attend to some specific pat-
terns. Feature detector.
10 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Speech Recognition Experiments: Conventional Hybrid System
WERs (%) for hybrid systems on LibriSpeech 960h.
• The first pass decoding generates lattices.• Rescore the lattices (denoted by →) with LSTM or Transformer (Trafo) LM.
Language ModelPrm.
dev test
in Mclean other clean other
PPL WER PPL WER PPL WER PPL WER
4-gram 230 152 3.4 141 8.3 158 3.8 146 8.8→ LSTM 1048 60 2.3 60 5.4 65 2.6 62 5.9→ Transformer 431 53 2.1 54 5.2 58 2.5 55 5.6
LSTM → Trafo - - 1.9 - 4.5 - 2.3 - 5.0
Large improvements over the highly optimized LSTM LM:• 10% relative improvements in PPL translate to:• 4% to 10% relative improvements in WER.
Define new state-of-the-art results [Luscher & Beck+ 19] on LibriSpeech 960h.
11 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Speech Recognition Experiments: Attention-based System
WERs (%) for attention-based models on LibriSpeech 960h dataset.Perplexities are on the 10K BPE level
Language Model
Bea
m
dev testclean other clean other
PPL WER PPL WER PPL WER PPL WERNone 12 - 4.3 - 12.9 - 4.4 - 13.5LSTM
6444 2.9 46 8.9 47 3.2 47 9.9
Transformer 36 2.6 39 8.4 39 2.8 39 9.3
• Follow [Hannun & Lee+ 19] (Interspeech 2019):Larger beam size and end-of-sentence penalty.
• Again, large improvements over the LSTM baseline.
• Best reported WERs for E2E systems w/o data augmentation e.g.SpecAugment [Park & Chan+ 19] (Interspeech 2019).
• Available on: https://github.com/rwth-i6/returnn-experiments
12 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Conclusion
Summary
• Successfully trained deep Transformer LMs with excellent performance for ASR.• Demonstrated that positional encoding is not needed for Transformer LMs.
• Visualized and identified hierarchical feature engineering inside Transformerlanguage models with link to some fundamental concepts in LM:
– N-gram, bag-of-words, and in top layers; max-entropy model-style features(but data driven)?
Future work
• Further scaling up (layer-wise training).• Reduce memory requirements of Transformers.
• More study on scalability of Transformer vs. LSTM vs. amount of data.
• For LSTMs: deeper (and wider) models with residual connections and layernormalization e.g., RNMT+ [Chen & Firat+ 18]?
13 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
Thank you for your attention.
Thanks to:Eugen Beck, Liuhui Deng, Christoph Luscher, Arne Nix, JulianSchamper, and Wei Zhou
References
[Al-Rfou & Choe+ 19] R. Al-Rfou, D. Choe, N. Constant, M. Guo, L. Jones.
Character-level language modeling with deeper self-attention.
In Proc. AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, Jan. 2019.
[Chen & Firat+ 18] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser,Z. Chen, Y. Wu, M. Hughes.
The best of both worlds: Combining recent advances in neural machine translation.
In Proc. ACL, pp. 76–86, Melbourne, Australia, July 2018.
[Gulcehre & Firat+ 17] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, Y. Bengio.
On using monolingual corpora in neural machine translation.
Computer Speech & Language, Vol. 45, pp. 137–148, Sept. 2017.
[Hannun & Lee+ 19] A. Hannun, A. Lee, Q. Xu, R. Collobert.
Sequence-to-sequence speech recognition with time-depth separable convolutions.
arXiv preprint arXiv:1904.02619, Vol., 2019.
[Liu & Saleh+ 18] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, . Kaiser, N. Shazeer.
Generating wikipedia by summarizing long sequences.
In Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, April 2018.
[Luscher & Beck+ 19] C. Luscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schluter, H. Ney.
RWTH ASR systems for LibriSpeech: Hybrid vs Attention.
In Submitted to Interspeech 2019, Graz, Austria, Sept. 2019.
[Panayotov & Chen+ 15] V. Panayotov, G. Chen, D. Povey, S. Khudanpur.
LibriSpeech: an ASR corpus based on public domain audio books.
In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, South Brisbane, Queensland, Australia, April 2015.
[Park & Chan+ 19] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le.
SpecAugment: A simple data augmentation method for automatic speech recognition.
In Proc. Interspeech, Graz, Austria, 2019.
[Radford & Wu+ 19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever.
Language models are unsupervised multitask learners.
[Online]. : https://blog.openai.com/better-language-models/, 2019.
14 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
References
[Sundermeyer & Schluter+ 12] M. Sundermeyer, R. Schluter, H. Ney.
LSTM neural networks for language modeling.
In Proc. Interspeech, pp. 194–197, Portland, OR, USA, Sept. 2012.
[Sundermeyer & Tuske+ 14] M. Sundermeyer, Z. Tuske, R. Schluter, H. Ney.
Lattice decoding and rescoring with long-span neural network language models.
In Proc. Interspeech, pp. 661–665, Singapore, Sept. 2014.
[Toshniwal & Kannan+ 18] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, K. Livescu.
A comparison of techniques for language model integration in encoder-decoder speech recognition.
In Proc. SLT, Athens, Greece, Dec. 2018.
[Vaswani & Shazeer+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, I. Polosukhin.
Attention is all you need.
In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017.
[Zeyer & Alkhouli+ 18] A. Zeyer, T. Alkhouli, H. Ney.
RETURNN as a generic flexible neural toolkit with application to translation and speech recognition.
In Proc. of the Joint Conf. of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Melbourne, Australia, July2018.
14 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019
top related