adaptive transformers for learning multimodal …adaptive transformers for learning multimodal...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 1–7July 5 - July 10, 2020. c©2020 Association for Computational Linguistics

1

Adaptive Transformers for Learning Multimodal Representations

Prajjwal [email protected]

Abstract

The usage of transformers has grownfrom learning about language semantics toforming meaningful visiolinguistic repre-sentations. These architectures are oftenover-parametrized, requiring large amounts ofcomputation. In this work, we extend adap-tive approaches to learn more about modelinterpretability and computational efficiency.Specifically, we study attention spans, sparse,and structured dropout methods to helpunderstand how their attention mechanismextends for vision and language tasks. Wefurther show that these approaches can help uslearn more about how the network perceivesthe complexity of input sequences, sparsitypreferences for different modalities, and otherrelated phenomena.

1 Introduction

Learning richer representations from visual andtext data is a central task to solve multi-modallearning. Attention-based methods have provento be very useful in learning long term dependen-cies and forming richer representations of the in-put sequences. Numerous approaches (Lu et al.,2019; Su et al., 2019; Li et al., 2019; Chen et al.,2019) have been proposed for learning visiolinguis-tic representations with transformers. Althoughthese approaches have provided us with significantimprovement on various benchmarks (languageand visiolinguistic), the architectures used are over-parameterized require extensive training lasting forseveral weeks using multiple objectives to forma generalized representation of the task to be ad-dressed, which is then followed by fine-tuning ona downstream task. This workflow has become aconcerning problem. It results in deep learningmethodologies being inaccessible and increasedcarbon footprints (Strubell et al., 2019). In thiswork, we specifically explore adaptive methods.

We refer to Adaptive mechanisms as those meth-ods that change their behavior during training/runtime and adapt stochastically to the environmentbased on data heuristics (parameters) learned by en-countering samples from the same data distributionoptimized by an objective function. Alternativeapproaches such as pruning, distillation (Hintonet al., 2015) and quantization are rigid to someextent and induce some form of permanent modi-fications to the model. Adaptive methods enforcethe network to learn parameters such that their be-havior changes as per the complexity of the inputsequence as perceived by the neural network. Thecode to reproduce the results in this work is pub-licly available at this link1.

Current self-attention approaches assume thatthe attention span of a head is invariant to the com-plexity of an input sequence. Attention heads canlearn their optimal context size (Sukhbaatar et al.,2019), which results in a reduction of FLOPS.When an optimal attention span is learned, theamount of attention given to a particular input se-quence by an attention head is determined by itscontext size. We show that the context size varieswith the emergent complexity of the sequence, andspans can help us understand how much sensitive alayer is to an input sequence.

Training models with a quarter of a million pa-rameters are not feasible and practical for mostusers. One effective way to facilitate neural net-work scaling is by making the weights of the net-work sparse. This configuration allows us to per-form faster training of deeper networks with rela-tively less compute. To make attention distributionssparse, we use α entmax (Correia et al., 2019) toobtain probability distribution of weights. Nor-malized exponential functions like softmax cannotassign a zero attention weight. This property en-

1https://github.com/prajjwal1/adaptive_transformer

https://github.com/prajjwal1/adaptive_transformerhttps://github.com/prajjwal1/adaptive_transformer

2

forces the context vector to stay dense, resultingin non-relevant sequences to be considered eventhough the network has discarded them by puttinga deficient weight. Adaptive sparsity can makean attention head to learn richer distributions byoscillating the behavior of distribution to stay be-tween softmax and sparsemax. We show that thisbehavior can help us understand preferences for thedensity of attention weight distribution and how itvaries amongst each head about different modality.

We also study a form of regularization methodcalled Layerdrop (Fan et al., 2019) to understandits regularization impact for multi-modal features.If the network can learn to drop identical layers(Data Driven pruning), then it can be regarded as anadaptive depth mechanism. We specifically use theEvery other pruning method where the user speci-fies the drop rate because it offers maximal gainsas suggested compared to its counterpart pruningmethods. This method has proven to be effectivein reducing the number of parameters and pruninglayers during inference.

The contribution of this work is as follows:

• The adaptive approaches have only beentested with linguistic features only. We extendthese approaches to study how do they alignto capture complex relationships between dif-ferent modalities. We also study the effects ofaligning these approaches to understand theircompatibility through ablation analysis.

• We perform interpretability analysis to learnhow these approaches can enhance our under-standing of attention behavior and adaptiveapproaches.

• We provide experimental results on the recentadaptive approaches for the multi-modal inputsequences.

2 Background

2.1 LXMERT

We use LXMERT (Tan and Bansal, 2019) as thebaseline architecture. The adaptive approaches canbe combined with any other self-attention mecha-nism based transformer. LXMERT uses self andcross attention layers to jointly attend to imageand text inputs (input sequence). Specifically, ittakes a word-level sentence and object-level imageembeddings. The encoder consists of three main

components: language (9 layers) and visual (5 lay-ers) encoder (single-modality) to form textual andimage representations and cross-modality encoder(5 layers) to jointly attend to both these representa-tions. Cross attention is responsible for forming themapping between ROI features and textual repre-sentations. Since the architecture used is identical,we refer the readers to (Tan and Bansal, 2019) for adetailed description of pre-training strategies. Thenetwork used has been pre-trained on four objec-tives: Masked Cross Modality LM, Masked ObjectPrediction, Cross Modality Matching, and ImageQuestion Answering. Faster RCNN is used to ex-tract ROI features from the input images.

2.2 Adaptive Attention Span

Unlike dynamic attention, which assumes that allattention heads require the same amount of span,learning an optimal attention span enables the gath-ering of information as per the context size deter-mined by the attention head. A max upper boundspan limit is enforced on each head, which helpsreduce computation and memory requirements. Asproposed in (Sukhbaatar et al., 2019), differentheads emphasize on different context dependingupon the task it is addressing. We explicitly showthat these spans vary significantly based on thecomplexity of the task. We use the same maskingfunction with minor modification:

mz(x) = min

[max

[1

R(R+ z − x), 0

], 1

](1)

Here, z acts as a model’s parameter. We initialize itwith kaiming normal (He et al., 2015) distribution.mz is coupled with the attention weights. Hyperpa-rameter R helps in controlling the softness of thisattention distribution.

The attention head compute the similarities be-tween current token t and past token r in the span[t− S, t) as:

str = xTt Q

T (Kxr + Pt−r) (2)

where K, Q and Pt−r denote key, query vectors,and position embedding respectively. In the stan-dard setting, attention weight distribution is ob-tained by applying softmax on the similarity vector.

Atr = softmax(str) (3)

The attention weights from Equation 3 are then

3

1 2 3 4 5Epoch

Language (Self Attention)

100

120

140

160

180

Atte

ntio

n Sp

an

layer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8

1 2 3 4 5Epoch

Language (Cross Attention)

120

130

140

150

160

170

180

190

Atte

ntio

n Sp

an

layer 0layer 1layer 2layer 3layer 4

1 2 3 4 5Epoch

Vision+Language (Cross Attention)

180

190

200

210

220

230

Atte

ntio

n Sp

an


1 2 3 4 5Epoch

Vision (Cross Attention)

100

120

140

160

180

200

Atte

ntio

n Sp

an


1 2 3 4 5Epoch

Vision (Self Attention)

165

170

175

180

185

190

195

200

Atte

ntio

n Sp

an


Figure 1: Variation of adaptive spans in different attention layers (single and cross-modality) as the training pro-gresses. Accuracy on the local-validation set is reported per epoch. The maximum adaptive span limit was set to1024

processed by the masking function as:

Atr =mz(t− r)exp(str)

t−1∑q=t−S

mz(t− q)exp(str)(4)

The masking function is a non-increasing func-tion that applies a transformation to the input valuesof attention scores to keep them in range of [0, 1].The parameters of mz are updated with model pa-rameters to learn the optimal span.

2.3 Adaptive Sparse AttentionIn order to make attention weights sparse, we useα entmax as proposed in (Correia et al., 2019).Specifically, softmax is replaced with α entmax tocompute attention weights given attention scoresin Equation 3.

Att(Q,K, V ) = π

(QK>√

d

)V (5)

π(Z)ij = α -entmax (zi)j (6)

α plays a crucial role in determining the behaviorof an attention head. If α > 1, the weight dis-tribution would move away from softmax’s denserepresentation towards sparse mappings as its cur-vature changes. For α = 2, we obtain completesparse mappings. The value of alpha oscillates be-tween 1 and 2. It is set as a network parameter,which is jointly optimized in the training process.Different values of α will govern the behavior ofthe attention head.

2.4 LayerDrop

Layerdrop (Fan et al., 2019) is a method to reducethe depth of the transformer in a controlled manner.This method drops the identical sub-layers in thetransformer determined by a pruning strategy. Wefollow the Every Other strategy, which drops thelayer as specified by a drop rate. It has been notedthat this pruning strategy works well as comparedto Search on Valid and Data Driven pruning strate-gies. Let N denote the total number of layers inthe network. Setting p = 1 implies that we aredropping one layer out of all the layers assignedfor a modality. The number of remaining layersbecomes N − p. Although the network will consistof an equivalent amount of parameters as that of Nlayers, all the operations will be carried out equiv-alent to operations in N − p layers. This strategyallows us to prune layers during inference time.

3 Experimental Setup

Visual Question Answering To solve the VQAtask, given an image and a question related to it,the network is supposed to predict the right an-swer from the given set of answer choices. Weperformed all the experimentation on the VQA 2.0dataset (Antol et al., 2015). The dataset consistsof three sets with a train set containing 83k imagesand 444k questions, a validation set containing 41kimages and 214k questions, and a test set contain-ing 81k images and 448k questions. In this case,the network is asked to predict an answer from3129 answer choices for a particular question.

4

Implementation We use the pre-trained weightsprovided by (Tan and Bansal, 2019). We fine-tuneLXMERT to form visiolinguistic representationsbased on image and text sequences with adaptiveapproaches mentioned above. This operation isfollowed by a classifier that receives the concate-nated pooled features of image and text to predictthe answer. Fine-tuning is performed on a singleP100 GPU with 128 batch size. Optimization is per-formed with Lookahead (Zhang et al., 2019) withLAMB (You et al., 2019) as the inner optimizer.Learning rate schedule is regulated by CyclicalLR (Smith, 2017), with base and max learning ratesset to 1e− 5 and 1e− 4.

4 Experimental Findings and Results

Adaptive span for understanding the complex-ity of the input sequence We demonstrate howlearning spans can help in understanding the behav-ior of individual layers. Figure 1 shows how spanvaries amongst different attention layers. Studyingspans can help us understand which layers are moresensitive to the input sequences encountered duringthe training process.

In the case of single modality encoder, spansfor self-attention layers for vision and languagedecrease monotonically, indicating that the learningbehavior is somewhat similar, although slopes tellus that the rate of learning is dissimilar. Similarbehavior is seen in the cross-modality encoder forlanguage.

Requiring a larger context size is indicativeof the complexity of the sequences. When self-attention attends to both modalities, we observethat the intermediate layers responsible for formingcomplex representations increase their spans. Thisobservation shows that a more significant span isnecessary to attend both modalities jointly. Self-attention also requires a high span when attendingto visual features in the cross-modality encoder.This observation shows that visual sequences areperceived as a more complex input to process thana language input in the cross-modality encoder.

Determining sparsity preferences for vision andlanguage modality with α The value of α deter-mines if the head is favoring sparse or dense atten-tion weight distribution. For dealing with languagemodality, self-attention favors mostly sparse map-ping of attention weights in intermediate layers.Similar behavior is observed inside cross-modalityencoder as well. This observation shows that lan-

52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0Accuracy

2

4

6

8

10

Epoc

h

layerdrop-10-6-6layerdrop-9-5-5attn_spansparse

Figure 2: Regularization effect of layerdrop

guage modality benefits from sparse weights beingassigned as attention distribution. The value ofα is restricted below 1.5 for processing visual in-puts. When vision modality is involved, heads thatpreferred sparse mapping initially are convergingtowards denser mapping, indicating that this repre-sentation of attention weights is preferred. We alsoobserve that when both modalities are involved, thenetwork prefers, even more, denser weight distribu-tion. This observation shows that vision modalityis given more preference (partly due to perceivedcomplexity) over language inputs to process thesequence. Figure 3 shows variation of α values astraining progresses.

Regularization effect of Layerdrop We con-sider two configurations of the model. The first onehas 10 language, 6 vision, and 6 cross-modalitylayers with drop rate (p) set to 1 layer. In thiscase, the number of parameters is more, but theFLOPS is equivalent to the standard 9-5-5 base-line configuration. The later one has the 9-5-5configuration with p set to 1. This rate causes aFLOP reduction of 17.54%. It is observed that lay-erdrop requires ∼3.5x more compute runtime forconvergence during training. A possible explana-tion can be that additional training aids in forminga consolidated understanding of multi-modal rep-resentations. Even after ensuring the convergenceof the model, a strong regularization effect (witha minimum value of p) prevents the network fromachieving performance that is close enough withthe mentioned adaptive methods with an equivalentnumber of parameters being used training. Figure 2and Table 2 shows this noted observations.

Quantitative Analysis In this section, Table 1compares the adaptive approaches with the baselinemodel and other state-of-the-art models, which relyupon standard softmax attention mechanism. We

5

Language Encoder (9 layers)Cross Modality Encoder (Lan-guage) (5 layers)

Cross Modality Encoder for Vi-sion and Language (5 layers)

Cross Modality Encoder for Vi-sion (5 layers) Vision Encoder (5 layers)

Figure 3: Variation of Alpha in Entmax in first six attention heads during an intermediate training stage of 9-5-5LXMERT model. X and Y axis denote epoch and alpha values, respectively. For simplicity, we only show alphavalues for the first six attention heads (12). Color codes denote different attention heads.

Figure 4: Top 5 confidence scores of an example input sequence Left: Adaptive Entmax Center: Adaptive Atten-tion Span Right: 10-6-6 config with Layerdrop (p=1). Zoom in to see scores and labels.

notice that these approaches achieve near closeperformance as standard attention mechanisms bybeing computationally efficient. The results arereported without any hyperparameter tuning.

Qualitative Analysis In this section, we analyzethe confidence scores on complex examples to bet-ter understand the network’s predictions. We usu-ally take the class with maximum confidence, butanalyzing confidence scores of other classes canhelp us learn about what the network is learningabout the similarity of different tasks in the image.Figure 4 shows confidence scores on an exampleinput. We observe that entmax aids in forming aconsolidated understanding of contrastive features.In most cases, the top 5 confidence scores includepredictions present in the ground truth. Due tosparse mapping, the network makes strong, con-fident predictions about one label. When trainedwith an adaptive attention span, the network some-times seems unsure about the correct label as ex-

pected from softmax behavior. It works well whena high probability is assigned to one label in theground truth. We did not observe comparable per-formance from Layerdrop. In this example, theright answer is assigned a deficient score. The net-work does not seem to learn distinguishing featuresfrom similar classes properly.

5 Ablation Analysis

We normalize attention scores with entmax insteadof softmax before applying the masking functionto use both adaptive attention span and sparse at-tention weights mapping. It is evident from Table 2that the adaptive span works better with the denserrepresentation of attention weights to perform op-timally. The effect of soft masking function isreduced when used with a sparse mapping func-tion. We evaluate the layerdrop method with twoconfigurations of the network 9-5-5 (language, vi-sion, and cross-modality layers) and 10-6-6 with

6

Model test-dev test-std

BUTD (Anderson et al., 2018) 65.32 65.67

ViLBERT (Lu et al., 2019) 70.55 70.92VLBERT (Su et al., 2019) 71.16 -VisualBERT (Li et al., 2019) 70.80 71.00UNITER (Chen et al., 2019) 72.27 72.46

LXMERT (Tan and Bansal, 2019)w/ softmax 72.42 72.54w/ Adaptive Attetion Span 71.62 71.72w/ Adaptive Sparse 71.73 71.97w/ Layerdrop (10-6-6) (p=1) 66.4 66.72

Table 1: Comparison to the state-of-the-art methodswith adaptive approaches on the VQA dataset.

Model test-dev test-std

LXMERT (Tan and Bansal, 2019)w/ Attention Span and Entmax 63.07 63.33Default (10-6-6) 66.35 66.57w/ Layerdrop (9-5-5) (p=1) 66.51 66.81

Table 2: Ablation study for Adaptive approaches

p = 1. From Table 2, we see that the shallowernetwork performs better than the deeper-layeredmodel. This observation shows that there is a spe-cific threshold drop rate up until which layerdrophelps. It is plausible that this type of regularizationis favorable in deeper networks.

6 Conclusion

While attention-based approaches are becominguniversal, computationally efficient ways mustbe favored for broader adoption of provided pre-trained models on low resource hardware. Adaptivemethods can significantly reduce the cost incurredto train such models and carbon footprints. In thiswork, we extend adaptive approaches to Visiolin-guistic tasks to understand more about attention andadaptive mechanisms. While the empirical resultsare encouraging, important future work includesexplorations of higher efficient adaptive and sparsemechanisms that can significantly cause FLOPSand parameter reduction with minimal loss in per-formance.

ReferencesPeter Anderson, Xiaodong He, Chris Buehler, Damien

Teney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of the IEEE conference on computer vi-sion and pattern recognition, pages 6077–6086.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In Proceedings of the IEEE internationalconference on computer vision, pages 2425–2433.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2019. Uniter: Learning univer-sal image-text representations. arXiv preprintarXiv:1909.11740.

Gonçalo M Correia, Vlad Niculae, and André FT Mar-tins. 2019. Adaptively sparse transformers. arXivpreprint arXiv:1909.00015.

Angela Fan, Edouard Grave, and Armand Joulin. 2019.Reducing transformer depth on demand with struc-tured dropout. arXiv preprint arXiv:1909.11556.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification.In Proceedings of the IEEE international conferenceon computer vision, pages 1026–1034.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019. Visualbert: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557.

Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In Advances in Neural Information Process-ing Systems, pages 13–23.

Leslie N Smith. 2017. Cyclical learning rates for train-ing neural networks. In 2017 IEEE Winter Confer-ence on Applications of Computer Vision (WACV),pages 464–472. IEEE.

Emma Strubell, Ananya Ganesh, and Andrew Mc-Callum. 2019. Energy and policy considera-tions for deep learning in nlp. arXiv preprintarXiv:1906.02243.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations.arXiv preprint arXiv:1908.08530.

7

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-janowski, and Armand Joulin. 2019. Adaptiveattention span in transformers. arXiv preprintarXiv:1905.07799.

Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality encoder representations from trans-formers. arXiv preprint arXiv:1908.07490.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song,James Demmel, and Cho-Jui Hsieh. 2019. Largebatch optimization for deep learning: Training bertin 76 minutes. arXiv preprint arXiv:1904.00962,1(5).

Michael Zhang, James Lucas, Jimmy Ba, and Geof-frey E Hinton. 2019. Lookahead optimizer: k stepsforward, 1 step back. In Advances in Neural Infor-mation Processing Systems, pages 9593–9604.

adaptive transformers for learning multimodal …adaptive transformers for learning multimodal...

Documents