language modeling with deep transformersirie/... · de ne newstate-of-the-artresults [luscher &...

Language Modeling with Deep Transformers

Kazuki Irie, Albert Zeyer, Ralf Schluter, Hermann Ney

Human Language Technology and Pattern Recognition GroupRWTH Aachen University, Aachen, Germany

INTERSPEECH 2019, Graz, AustriaNeural Networks for Language Modeling [Thu-O-10-1], September 19, 2019

Introduction

• 2017: Advent of Transformer [Vaswani & Shazeer+ 17] in NLP/beyond.

• Originally an encoder-decoder model for machine translation.

• Decoder component: language model

– Early work in text generation (5 layers) [Liu & Saleh+ 18] ICLR 2018

• Gain in popularity more recently:

– Google 64-layer Transformer character LM [Al-Rfou & Choe+ 19] AAAI 2019

– OpenAI GPT-2 LM (48 layers) [Radford & Wu+ 19] Blog February 2019

• Large scale language model pre-training at the center of interest in NLP.

– Nvidia, Megatron LM (72 layers) Blog August 2019

– Salesforce, Controllable Transformer LM (48 layers) Last week!

2 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019

Contributions of this work

• Application of Transformer language models to ASR

– Successful training of deep and powerful Transformer language models.– Evaluation in both hybrid and attention based end-to-end ASR.

– Large improvements over the state-of-the-art LSTM LM.

• Comprehensive hyper-parameter tuning

– Crucial for studying a new model.– In particular for Transformers which have lots of hyper-parameters.

• Demonstration of an LM specific property of Transformers

– LM task automatically provides positional information: No need for extra signal.

• Analysis and visualization

• Release of model configurations and checkpoints (link in the paper)https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers

Open-source toolkit RETURNN [Zeyer & Alkhouli+ 18]

Transformer Language Model

• Stack L layers; each consisting of self-attention and feed-forward modules.• Apply residual connections and layer normalization across modules.

• Self-attention typically has multiple attention heads.

Self-Attention

LayerNorm

Feed-forward

LayerNorm

PositionalEncoding

Experimental Setups

LibriSpeech dataset [Panayotov & Chen+ 15].• 960h audio, read speech transcriptions.• Large LM task: 200K vocab, 800M-word extra textual training data.

Language modeling for speech recognition in 2 settings:

• Word-level models for conventional hybrid HMM/NN system by latticerescoring [Sundermeyer & Tuske+ 14].Push-forwarding Transformer states instead LSTM states.

• BPE subword-level models for end-to-end attention based system byshallow fusion [Gulcehre & Firat+ 17, Toshniwal & Kannan+ 18].

Intensive tuning of the baseline LSTM LM [Sundermeyer & Schluter+ 12]

• All tuning details provided in the paper.• Wide model gave the best results: 2 layers with 4096 LSTM nodes.• Rel. improvements in PPL over 4-gram of about 58%.

Optimization of Transformer models

Exhaustive list of hyper-parameters is long.• Number of layers & dimension of the residual connection.• (Dimension of input word embeddings).• For each layer: number of attention heads, dimension of the key and query,

dimension of the value, and dimension of the feed-forward layer.

To reduce this complexity,• Use the same dimension for key, query, value, and the residual connection.• Use the same dimensionality across all layers.

4 hyper-parameters to describe all our models.

• Number of layers L.• Feed-forward dimension dff .• Residual dimension dres.• Number of attention heads H .

Effect of Depth and Width (Highlight)

Perplexity after 2.5 epoch (H = 8, dres = 512).

L dffParams. Perplexity

in M Train Dev

122,048

243 67.6 67.124 281 62.2 62.342 338 59.0 59.6

6 8,192 262 66.7 66.712 4,096 268 63.5 63.8

416,384 277 67.6 67.432,768 344 65.4 68.4

• For a given parameter budget, deep models tend to perform better.

Full tuning details in the paper!• Effect of number of heads: helps up to 16 ! 8 is already good.• Effect of activation ReLU, GeLU, GLU: the standard ReLU is fine!• Parameter tying (Universal Transformers): improvements w/o extra params!

Optimization of Transformer models: Final results

Further scaling up:

Best model: 96-layer model (L = 96, dff = 2048, dres = 512, H = 8)

(112-layer model even got slightly better after camera-ready deadline.)

Final perplexity on LibriSpeech 200K vocab word level.

LMParam. Perplexity

in M Dev Test4-gram 230 146 152LSTM 1048 60 63Transformer 431 54 56

Large improvements over the highly optimized LSTM LM:

• About 11% relative improvements in PPL.

Do we need extra positional encoding in Tranformer LM?

• Amount of information increases at each time step in LM: position signal?• Our finding: External positional encoding unnecessary.

– Even slight improvements in perplexity w/o positional encoding.

• Attention in the first layer (all 8 heads per target word position shown)

> soth

nt on to the

ndah and

on the

s of the

d to the

sotheywent

theverandah

andlookeddownupon

thelights

prisonand

listenedto

thesea

lappingthe

shore<eos>

With positional encoding

> soth

nt on to the

ndah and

on the

s of the

d to the

sotheywent

theverandah

andlookeddownupon

thelights

prisonand

listenedto

thesea

lappingthe

shore<eos>

Without positional encoding

Input word:

Other Layers: 3 categories

Analysis for the 24-layer model which is also valid for deeper models.

• There are 3 functional groups of layers.

> soth

nt on to the

ndah and

on the

s of the

d to the

sotheywent

theverandah

andlookeddownupon

thelights

prisonand

listenedto

thesea

lappingthe

shore<eos>

> soth

nt on to the

ndah and

on the

s of the

d to the

sotheywent

theverandah

andlookeddownupon

thelights

prisonand

listenedto

thesea

lappingthe

shore<eos>

> soth

nt on to the

ndah and

on the

s of the

d to the

sotheywent

theverandah

andlookeddownupon

thelights

prisonand

listenedto

thesea

lappingthe

shore<eos>

Bottom layers (2-3):“Blur”

≈ Average over all positions;

Bag-of-words. Global info.

Some heads focus on difficult

words, here verandah.

Mid layers (4-9):“Window”

Focus on the local n-gram.

Top layers (10-24):

“Structured”

Attend to some specific pat-

terns. Feature detector.

Speech Recognition Experiments: Conventional Hybrid System

WERs (%) for hybrid systems on LibriSpeech 960h.

• The first pass decoding generates lattices.• Rescore the lattices (denoted by →) with LSTM or Transformer (Trafo) LM.

Language ModelPrm.

dev test

in Mclean other clean other

PPL WER PPL WER PPL WER PPL WER

4-gram 230 152 3.4 141 8.3 158 3.8 146 8.8→ LSTM 1048 60 2.3 60 5.4 65 2.6 62 5.9→ Transformer 431 53 2.1 54 5.2 58 2.5 55 5.6

LSTM → Trafo - - 1.9 - 4.5 - 2.3 - 5.0

Large improvements over the highly optimized LSTM LM:• 10% relative improvements in PPL translate to:• 4% to 10% relative improvements in WER.

Define new state-of-the-art results [Luscher & Beck+ 19] on LibriSpeech 960h.

Speech Recognition Experiments: Attention-based System

WERs (%) for attention-based models on LibriSpeech 960h dataset.Perplexities are on the 10K BPE level

Language Model

dev testclean other clean other

PPL WER PPL WER PPL WER PPL WERNone 12 - 4.3 - 12.9 - 4.4 - 13.5LSTM

6444 2.9 46 8.9 47 3.2 47 9.9

Transformer 36 2.6 39 8.4 39 2.8 39 9.3

• Follow [Hannun & Lee+ 19] (Interspeech 2019):Larger beam size and end-of-sentence penalty.

• Again, large improvements over the LSTM baseline.

• Best reported WERs for E2E systems w/o data augmentation e.g.SpecAugment [Park & Chan+ 19] (Interspeech 2019).

• Available on: https://github.com/rwth-i6/returnn-experiments

Conclusion

Summary

• Successfully trained deep Transformer LMs with excellent performance for ASR.• Demonstrated that positional encoding is not needed for Transformer LMs.

• Visualized and identified hierarchical feature engineering inside Transformerlanguage models with link to some fundamental concepts in LM:

– N-gram, bag-of-words, and in top layers; max-entropy model-style features(but data driven)?

Future work

• Further scaling up (layer-wise training).• Reduce memory requirements of Transformers.

• More study on scalability of Transformer vs. LSTM vs. amount of data.

• For LSTMs: deeper (and wider) models with residual connections and layernormalization e.g., RNMT+ [Chen & Firat+ 18]?

Thank you for your attention.

Thanks to:Eugen Beck, Liuhui Deng, Christoph Luscher, Arne Nix, JulianSchamper, and Wei Zhou

References

[Al-Rfou & Choe+ 19] R. Al-Rfou, D. Choe, N. Constant, M. Guo, L. Jones.

Character-level language modeling with deeper self-attention.

In Proc. AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, Jan. 2019.

[Chen & Firat+ 18] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser,Z. Chen, Y. Wu, M. Hughes.

The best of both worlds: Combining recent advances in neural machine translation.

In Proc. ACL, pp. 76–86, Melbourne, Australia, July 2018.

[Gulcehre & Firat+ 17] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, Y. Bengio.

On using monolingual corpora in neural machine translation.

Computer Speech & Language, Vol. 45, pp. 137–148, Sept. 2017.

[Hannun & Lee+ 19] A. Hannun, A. Lee, Q. Xu, R. Collobert.

Sequence-to-sequence speech recognition with time-depth separable convolutions.

arXiv preprint arXiv:1904.02619, Vol., 2019.

[Liu & Saleh+ 18] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, . Kaiser, N. Shazeer.

Generating wikipedia by summarizing long sequences.

In Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, April 2018.

[Luscher & Beck+ 19] C. Luscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schluter, H. Ney.

RWTH ASR systems for LibriSpeech: Hybrid vs Attention.

In Submitted to Interspeech 2019, Graz, Austria, Sept. 2019.

[Panayotov & Chen+ 15] V. Panayotov, G. Chen, D. Povey, S. Khudanpur.

LibriSpeech: an ASR corpus based on public domain audio books.

In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, South Brisbane, Queensland, Australia, April 2015.

[Park & Chan+ 19] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le.

SpecAugment: A simple data augmentation method for automatic speech recognition.

In Proc. Interspeech, Graz, Austria, 2019.

[Radford & Wu+ 19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever.

Language models are unsupervised multitask learners.

[Online]. : https://blog.openai.com/better-language-models/, 2019.

References

[Sundermeyer & Schluter+ 12] M. Sundermeyer, R. Schluter, H. Ney.

LSTM neural networks for language modeling.

In Proc. Interspeech, pp. 194–197, Portland, OR, USA, Sept. 2012.

[Sundermeyer & Tuske+ 14] M. Sundermeyer, Z. Tuske, R. Schluter, H. Ney.

Lattice decoding and rescoring with long-span neural network language models.

In Proc. Interspeech, pp. 661–665, Singapore, Sept. 2014.

[Toshniwal & Kannan+ 18] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, K. Livescu.

A comparison of techniques for language model integration in encoder-decoder speech recognition.

In Proc. SLT, Athens, Greece, Dec. 2018.

[Vaswani & Shazeer+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, I. Polosukhin.

Attention is all you need.

In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017.

[Zeyer & Alkhouli+ 18] A. Zeyer, T. Alkhouli, H. Ney.

RETURNN as a generic flexible neural toolkit with application to translation and speech recognition.

In Proc. of the Joint Conf. of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Melbourne, Australia, July2018.

language modeling with deep transformersirie/... · de ne newstate-of-the-artresults [luscher &...

Documents

leveraging knowledge graphs for web-scale unsupervised...

interspeech 2008: accepted papers - aminer€¦ ·...

deep calls to deep

– interspeech 2011...domain and a mixed mode excitation...

group of companies with the construction of a...

nailon / – software for online analysis of prosody...

general assembly interspeech 2007 antwerp, belgium

interspeech 2018 -...

survey of interspeech 2013

enes witwit university of heidelberg may 10, 2017 · enes...

new automatically classifying self-rated personality scores...

newstate$spasreimburse$340b$covered$entitiesat$actual...

isca general assembly interspeech 2012 portland, usa

survey of interspeech 2013 reporter: yi-ting wang 2013/09/10

inter-speech clicks in an interspeech...

state · 2010. 9. 13. · september 13, 2010 a newstate...

noise robust automatic speech recognition · speech...

9/5/20051 acoustic/prosodic and lexical correlates of...

noise robust pitch tracking by subband autocorrelation...

9tthh european conference on speech communication and...