a tensorized transformer for language modeling
TRANSCRIPT
A Tensorized Transformer for Language Modeling
Tianjin University
Xindian Ma, Peng Zhang*, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, Ming Zhou
NeurIPS 2019
Backgroundâą Transformer has led to breakthroughs in natural language processing tasks.âą Transformer, and its variant model BERT, limit the effective deployment of the
model to limited resource setting.
âą Some compression methods have been proved.âą TT-embedding âą BTRNN âą Tensorizing Neural Networks
Some Compression Methodsâą TT-Embedding [1]
âą Tensor-Train decomposition is used to compress the embedding layer (look-up table).
âą BTRNN [2]âą Block-term tensor decomposition is used to compress the input layers in LSTM
âą Tensorizing Neural Networks [3]âą Tensor Train format is used to compress the fully-connected layers.
[1] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019[2] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9378â9387, 2018.[3] Novikov A, Podoprikhin D, Osokin A, et al. Tensorizing neural networks[C]//Advances in neural information processing systems. 2015: 442-450.
Problem Formulationâą The goals are:
âą To linearly represent a self-attention by a group of basic vectors
âą To compress multi-head attention in Transformer
âą After compressing, it can be directly integrated into the encoder and decoder framework of Transformer
MethodsBasic Ideasâą Low-rank decompositionâą Parameters sharing
Using Tucker decomposition formulation is to construct Single-block attention
Using Block-term decomposition + Parameters sharing formulation is to construct multi-head mechanisms(Multi-linear attention)
Transformer Language Modelingâą Scaled Dot Production Attention
âą đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽ đđ,đŸđŸ,đđ = đ đ đŽđŽđ đ đŽđŽđ đ đ đ đ đ đđđŸđŸđđ
đđđđ
âą Multi-head Attention
âą đđđđđđđŽđŽđŽđŽđđđŽđŽđ đ đđđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽ đđ,đŸđŸ,đđ = đ¶đ¶đŽđŽđŽđŽđ¶đ¶đ đ đŽđŽ âđŽđŽđ đ đđ1,⯠,âđŽđŽđ đ đđđđ đđđđ
đ€đ€âđŽđŽđ€đ€đŽđŽ âđŽđŽđ đ đđđđ = đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽ(đđđđđđđđ,đŸđŸđđđđ
đŸđŸ,đđđđđđđđ)
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.
Multi-group parameters
Linear Representationâą Theorem: âScaled Dot Production Attentionâ can be represented by a linear
combination of a set of basis vectors.
âą đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽ đđ,đŸđŸ,đđ = đŽđŽ1,⯠, đŽđŽđđ đđ
âą đ€đ€âđŽđŽđ€đ€đŽđŽ đđ â âđđĂđđ đŽđŽđ đ đ đ đ¶đ¶đŽđŽđŽđŽđ đ đ đ đ¶đ¶đŽđŽđŽđŽđŽđŽđŽđŽ đ đ đ đ đŽđŽđ€đ€đŽđŽđ đ .âą
Tucker Decomposition
đ·đ·1
đ·đ·2
đ·đ·3
âđđ
đŸđŸ1
đŸđŸ2
đŸđŸ3
đąđą
đżđżđđ
đżđżđđ
đżđżđđ
đ¶đ¶đŽđŽđ€đ€đŽđŽ tensor
đčđčđ đ đ¶đ¶đŽđŽđŽđŽđ€đ€ đđđ đ đŽđŽđ€đ€đŽđŽđ¶đ¶đŽđŽđ đ
Basic vectors
Single-Block Attention
đ·đ·1
đ·đ·2
đ·đ·3
âđđ
đŸđŸ1
đŸđŸ2
đŸđŸ3
đąđą
đżđżđđ
đżđżđđ
đżđżđđ
đ¶đ¶đŽđŽđ€đ€đŽđŽ tensor
đčđčđ đ đ¶đ¶đŽđŽđŽđŽđ€đ€ đđđ đ đŽđŽđ€đ€đŽđŽđ¶đ¶đŽđŽđ đ
đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽTD đąđą;đđ,đŸđŸ,đđ = đąđą ïżœ1 đđ â 2 đŸđŸ ïżœ3 đđ
= ïżœđđ=1
đŒđŒïżœ
đđ=1
đœđœïżœ
đđ=1
đđđąđąđđđđđđ đđđđ â đŸđŸđđ â đđđđ
â đŽđŽđ đ đŽđŽâđŽđŽ đŽđŽđđđŽđŽđŽđŽđ€đ€ đđđ€đ€đŽđŽđđđđđ¶đ¶đŽđŽ.
Single-Block Attention in Transformer
đŽđŽđđđŽđŽđđđđđŽđŽđŽđŽ
đŽđŽđŽđŽ
đđ đŸđŸ
đđ
đđđđ
đđđđ
đ đ
đąđą
đ đ
đ đ
đ đ
đŽđŽ
đŽđŽ
đđđŁđŁ
đŽđŽ
Lower-Rank Decompositionâą The Core-tensor đąđą is defined as follows.
âą đąđąđđđđđđ = ïżœđ€đ€đ đ đŽđŽđđ 0,1 , đŽđŽ = đđ = đ đ 0 đŽđŽđŽđŽâđŽđŽđ€đ€đ€đ€đŽđŽđ đ đŽđŽ
âą In this case , it can be set as đŒđŒ = đœđœ = đđ = đ đ âą R is the Rank.
The time complexity of Single-block attention is đȘđȘ đđ3 .The time complexity of Scaled dot production attention is đȘđȘ đđ2đđ .
Reconstruction for Scaled dot product attentionâą Corollary: Scaled dot product attention can be reconstructed by Single
block attention
âą đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽ(đđ,đŸđŸ,đđ)đđ,đđ = âđđ=1đœđœ đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđđđđ(đąđą;đđ,đŸđŸ,đđ)đđ,đđ,đđ
âą đ€đ€âđŽđŽđ€đ€đŽđŽ đŽđŽ, đđ đ đ đŽđŽđđ đ đ đ đ đ€đ€đŽđŽ đŽđŽâđŽđŽ đŽđŽđŽđŽđđđŽđŽđ¶đ¶đŽđŽđ đ đŽđŽđ đ đŽđŽâđŽđŽ đ đ đŽđŽđŽđŽđ đ đđđŽđŽ â đđđđđŽđŽđ¶đ¶đđ đ đ đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽâČđ đ đŽđŽđđđđđđđŽđŽ.
How to Get Richer Representationâą Tensor Splitâą Matrices Concat
âź
đđđŽđŽđŽđŽđ đ đŽđŽđ€đ€ đđđđđđđŽđŽđŽđŽ
âź
đđđ đ đŽđŽđ€đ€đŽđŽđ¶đ¶đŽđŽđ đ đ¶đ¶đŽđŽđŽđŽđ¶đ¶đ đ đŽđŽ
đ»đ»đđ
đ»đ»đđ
đ»đ»đđ
Multi-linear Attention by Block-term Decompositionâą It is important to constructed the multi-head mechanism for modeling long-range dependency.
âą How to design the model with higher compression ratios?âą 1) Block-term decomposition (method) âą 2) Parameters sharing (idea)
Block-term Decomposition
Chen Y, Jin X, Kang B, et al. Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks[C]//IJCAI. 2018: 635-641.
Multi-linear Attention by Block-term DecompositionIn order to construct the multi-head mechanism, Multi-linear attention can be formulated as follows:
đđđđđđđŽđŽđŽđŽđđđŽđŽđŽđŽđŽđŽđ đ đ€đ€đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽ đąđą;đđ,đŸđŸ,đđ = đđđđđđđŽđŽđŽđŽđ¶đ¶đŽđŽđŽđŽđ¶đ¶đ đ đŽđŽ 1ââ đđ1 + âŻ+ đđâ đđđđ
đ€đ€âđŽđŽđ€đ€đŽđŽ đđđđ = đŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđŽđđđđ(đąđąđđ;đđđđđđ,đŸđŸđđđŸđŸ,đđđđđđ)
Parameters Sharing
Experimental Results in Language ModelingWMT-16 English-to-German
Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems forwmt 16.arXiv preprint arXiv:1606.02891, 2016.
Conclusionâą We provided a novel self-attention method, namely Multi-linear attention.âą The Tensorized Transformer model combines two compression ideas,
parameters sharing and low-rank decomposition.âą Our methods achieve higher compression ratio and better experimental
results in language modeling.âą The Tensorized Transformer model can be implied to more NLP tasks with
limited resources through further optimization