sa-net: shuffle attention for deep convolutional …

9
SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL NEURAL NETWORKS Qing-Long Zhang, Yu-Bin Yang * State Key Laboratory for Novel Software Technology at Nanjing University ABSTRACT Attention mechanisms, which enable a neural network to ac- curately focus on all the relevant elements of the input, have become an essential component to improve the performance of deep neural networks. There are mainly two attention mecha- nisms widely used in computer vision studies, spatial attention and channel attention, which aim to capture the pixel-level pairwise relationship and channel dependency, respectively. Although fusing them together may achieve better performance than their individual implementations, it will inevitably in- crease the computational overhead. In this paper, we propose an efficient Shuffle Attention (SA) module to address this issue, which adopts Shuffle Units to combine two types of attention mechanisms effectively. Specifically, SA first groups channel dimensions into multiple sub-features before processing them in parallel. Then, for each sub-feature, SA utilizes a Shuffle Unit to depict feature dependencies in both spatial and channel dimensions. After that, all sub-features are aggregated and a “channel shuffle” operator is adopted to enable information communication between different sub-features. The proposed SA module is efficient yet effective, e.g., the parameters and computations of SA against the backbone ResNet50 are 300 vs. 25.56M and 2.76e-3 GFLOPs vs. 4.12 GFLOPs, respectively, and the performance boost is more than 1.34% in terms of Top-1 accuracy. Extensive experimental results on common- used benchmarks, including ImageNet-1k for classification, MS COCO for object detection, and instance segmentation, demonstrate that the proposed SA outperforms the current SOTA methods significantly by achieving higher accuracy while having lower model complexity. The code and models are available at https://github.com/wofmanaf/SA-Net. Index Termsspatial attention, channel attention, chan- nel shuffle, grouped features 1. INTRODUCTION Attention mechanisms have been attracting increasing atten- tion in research communities since it helps to improve the representation of interests, i.e., focusing on essential features while suppressing unnecessary ones [1, 2, 3, 4]. Recent stud- ies show that correctly incorporating attention mechanisms * Funded by the Natural Science Foundation of China (No. 61673204). Fig. 1. Comparisons of recently SOTA attention models on ImageNet-1k, including SENet, CBAM, ECA-Net, SGE-Net, and SA-Net, using ResNets as backbones, in terms of accuracy, network parameters, and GFLOPs. The size of circles indicates the GFLOPs. Clearly, the proposed SA-Net achieves higher accuracy while having less model complexity. into convolution blocks can significantly improve the perfor- mance of a broad range of computer vision tasks, e.g., image classification, object detection, and instance segmentation. There are mainly two types of attention mechanisms most commonly used in computer vision: channel attention and spa- tial attention, both of which strengthen the original features by aggregating the same feature from all the positions with different aggregation strategies, transformations, and strength- ening functions [5, 6, 7, 8, 9]. Based on these observations, some studies, including GCNet [1] and CBAM [10] integrated both spatial attention and channel attention into one module and achieving significant improvement [10, 4]. However, they generally suffered from either converging difficulty or heavy computation burdens. Other researches managed to simplify the structure of channel or spatial attention [1, 11]. For exam- ple, ECA-Net [11] simplifies the process of computing channel weights in SE block by using a 1-D convolution. SGE [12] groups the dimension of channels into multiple sub-features to represent different semantics, and applies a spatial mechanism to each feature group by scaling the feature vectors over all arXiv:2102.00240v1 [cs.CV] 30 Jan 2021

Upload: others

Post on 18-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL NEURAL NETWORKS

Qing-Long Zhang, Yu-Bin Yang∗

State Key Laboratory for Novel Software Technology at Nanjing University

ABSTRACT

Attention mechanisms, which enable a neural network to ac-curately focus on all the relevant elements of the input, havebecome an essential component to improve the performance ofdeep neural networks. There are mainly two attention mecha-nisms widely used in computer vision studies, spatial attentionand channel attention, which aim to capture the pixel-levelpairwise relationship and channel dependency, respectively.Although fusing them together may achieve better performancethan their individual implementations, it will inevitably in-crease the computational overhead. In this paper, we proposean efficient Shuffle Attention (SA) module to address this issue,which adopts Shuffle Units to combine two types of attentionmechanisms effectively. Specifically, SA first groups channeldimensions into multiple sub-features before processing themin parallel. Then, for each sub-feature, SA utilizes a ShuffleUnit to depict feature dependencies in both spatial and channeldimensions. After that, all sub-features are aggregated anda “channel shuffle” operator is adopted to enable informationcommunication between different sub-features. The proposedSA module is efficient yet effective, e.g., the parameters andcomputations of SA against the backbone ResNet50 are 300 vs.25.56M and 2.76e-3 GFLOPs vs. 4.12 GFLOPs, respectively,and the performance boost is more than 1.34% in terms ofTop-1 accuracy. Extensive experimental results on common-used benchmarks, including ImageNet-1k for classification,MS COCO for object detection, and instance segmentation,demonstrate that the proposed SA outperforms the currentSOTA methods significantly by achieving higher accuracywhile having lower model complexity. The code and modelsare available at https://github.com/wofmanaf/SA-Net.

Index Terms— spatial attention, channel attention, chan-nel shuffle, grouped features

1. INTRODUCTION

Attention mechanisms have been attracting increasing atten-tion in research communities since it helps to improve therepresentation of interests, i.e., focusing on essential featureswhile suppressing unnecessary ones [1, 2, 3, 4]. Recent stud-ies show that correctly incorporating attention mechanisms

∗Funded by the Natural Science Foundation of China (No. 61673204).

Fig. 1. Comparisons of recently SOTA attention models onImageNet-1k, including SENet, CBAM, ECA-Net, SGE-Net,and SA-Net, using ResNets as backbones, in terms of accuracy,network parameters, and GFLOPs. The size of circles indicatesthe GFLOPs. Clearly, the proposed SA-Net achieves higheraccuracy while having less model complexity.

into convolution blocks can significantly improve the perfor-mance of a broad range of computer vision tasks, e.g., imageclassification, object detection, and instance segmentation.

There are mainly two types of attention mechanisms mostcommonly used in computer vision: channel attention and spa-tial attention, both of which strengthen the original featuresby aggregating the same feature from all the positions withdifferent aggregation strategies, transformations, and strength-ening functions [5, 6, 7, 8, 9]. Based on these observations,some studies, including GCNet [1] and CBAM [10] integratedboth spatial attention and channel attention into one moduleand achieving significant improvement [10, 4]. However, theygenerally suffered from either converging difficulty or heavycomputation burdens. Other researches managed to simplifythe structure of channel or spatial attention [1, 11]. For exam-ple, ECA-Net [11] simplifies the process of computing channelweights in SE block by using a 1-D convolution. SGE [12]groups the dimension of channels into multiple sub-features torepresent different semantics, and applies a spatial mechanismto each feature group by scaling the feature vectors over all

arX

iv:2

102.

0024

0v1

[cs

.CV

] 3

0 Ja

n 20

21

Page 2: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

locations with an attention mask. However, they did not takefull advantage of the correlation between spatial and channelattention, making them less efficient. “Can one fuse differentattention modules in a lighter but more efficient way?”

To answer this question, we first revisit the unit of Shuf-fleNet v2 [13], which can efficiently construct a multi-branchstructure and process different branches in parallel. Specif-ically, at the beginning of each unit, the input of c featurechannels are split into two branches with c − c′ and c′ chan-nels. Afterwards, several convolution layers are adopted tocapture a higher-level representation of the input. After theseconvolutions, the two branches are concatenated to make thenumber of channels as same as the number of input. At last,the “channel shuffle” operator (defined in [14]) is adopted toenable information communication between the two branches.In addition, to increase calculation speed, SGE [12] introducesa grouping strategy, which divides the input feature map intogroups along the channel dimension. Then all sub-features canbe enhanced parallelly.

Based on these above observations, this paper proposes alighter but more efficient Shuffle Attention (SA) module fordeep Convolutional Neural Networks(CNNs), which groupsthe dimensions of channel into sub-features. For each sub-feature, SA adopts the Shuffle Unit to construct channel atten-tion and spatial attention simultaneously. For each attentionmodule, this paper designs an attention mask over all the posi-tions to suppress the possible noises and highlight the correctsemantic feature regions as well. Experimental results onImageNet-1k (which are shown in Figure 1) have shown thatthe proposed simple but effective module containing fewerparameters has achieved higher accuracy than the current state-of-the-art methods.

The key contributions in this paper are summarized asfollows: 1) we introduce a lightweight yet effective attentionmodule, SA, for deep CNNs, which groups channel dimen-sions into multiple sub-features, and then utilizes a ShuffleUnit to integrate the complementary channel and spatial at-tention module for each sub-feature. 2) extensive experimen-tal results on ImageNet-1k and MS COCO demonstrate thatthe proposed SA has lower model complexity than the state-of-the-art attention approaches while achieving outstandingperformance.

2. RELATED WORK

Multi-branch architectures. Multi-branch architectures ofCNNs have evolved for years and are becoming more accu-rate and faster. The principle behind multi-branch architec-tures is “split-transform-merge”, which eases the difficulty oftraining networks with hundreds of layers. The InceptionNetseries [15, 16] are successful multi-branch architectures ofwhich each branch is carefully configured with customizedkernel filters, in order to aggregate more informative and mul-tifarious features. ResNets [17] can also be viewed as two-

branch networks, in which one branch is the identity mapping.SKNets [2] and ShuffleNet families [13] both followed theidea of InceptionNets with various filters for multiple brancheswhile differing in at least two important aspects. SKNetsutilized an adaptive selection mechanism to realize adaptivereceptive field size of neurons. ShuffleNets further merged“channel split” and “channel shuffle” operators into a singleelement-wise operation to make a trade-off between speed andaccuracy.

Grouped Features. Learning features into groups datesback to AlexNet [18], whose motivation is distributing themodel over more GPU resources. Deep Roots examinedAlexNet and pointed out that convolution groups can learn bet-ter feature representations. The MobileNets [19, 20] and Shuf-fleNets [13] treated each channel as a group, and modeled thespatial relationships within these groups. CapsuleNets [21, 22]modeled each grouped neuron as a capsule, in which the neu-ron activity in the active capsule represented various attributesof a particular entity in the image. SGE [12] developed Cap-suleNets and divided the dimensions of channel into multiplesub-features to learn different semantics.

Attention mechanisms. The significance of attention hasbeen studied extensively in the previous literature. It biases theallocation of the most informative feature expressions whilesuppressing the less useful ones. The self-attention methodcalculates the context in one position as a weighted sum ofall the positions in an image. SE [23] modeled channel-wiserelationships using two FC layers. ECA-Net [11] adopted a1-D convolution filter to generate channel weights and signifi-cantly reduced the model complexity of SE. Wang et al. [24]proposed the non-local(NL) module to generate a considerableattention map by calculating the correlation matrix betweeneach spatial point in the feature map. CBAM [10], GCNet [1],and SGE [12] combined the spatial attention and channel at-tention serially, while DANet [3] adaptively integrated localfeatures with their global dependencies by summing the twoattention modules from different branches.

3. SHUFFLE ATTENTION

In this section, we firstly introduce the process of construct-ing the SA module, which divides the input feature map intogroups, and uses Shuffle Unit to integrate the channel attentionand spatial attention into one block for each group. After that,all sub-features are aggregated and a “channel shuffle” opera-tor is utilized to enable information communication betweendifferent sub-features. Then, we show how to adopt SA fordeep CNNs. Finally, we visualize the effect and validate thereliability of the proposed SA. The overall architecture of SAmodule is illustrated in Figure 2.

Feature Grouping. For a given feature map X ∈RC×H×W , where C, H , W indicate the channel num-ber, spatial height, and width, respectively, SA first di-vides X into G groups along the channel dimension, i.e.,

Page 3: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

Fig. 2. An overview of the proposed SA module. It adopts “channel split” to process the sub-features of each group in parallel.For channel attention branch, using GAP to generate channel-wise statistics, then use a pair of parameters to scale and shift thechannel vector. For spatial attention branch, adopting group norm to generate spatial-wise statistics, then a compact feature iscreated similar to the channel branch. The two branches are then concatenated. After that, all sub-features are aggregated andfinally we utilize a “channel shuffle” operator to enable information communication between different sub-features.

X = [X1, · · · , XG], Xk ∈ RC/G×H×W , in which each sub-feature Xk gradually captures a specific semantic response inthe training process. Then, we generate the corresponding im-portance coefficient for each sub-feature through an attentionmodule. Specifically, at the beginning of each attention unit,the input of Xk is split into two branches along the channelsdimension, i.e., Xk1, Xk2 ∈ RC/2G×H×W . As illustrated inFigure 2, one branch is adopted to produce a channel attentionmap by exploiting the inter-relationship of channels, whilethe other branch is used to generate a spatial attention map byutilizing the inter-spatial relationship of features, so that themodel can focus on “what” and “where” is meaningful.

Channel Attention. An option to fully capture channel-wise dependencies is utilizing the SE block proposed in [23].However, it will bring too many parameters, which is notgood for designing a more lightweight attention module interms of a trade-off between speed and accuracy. Also, itis not suitable to generate channel weights by performing afaster 1-D convolution of size k like ECA [11] because k tendsto be larger. To improve, we provide an alternative, whichfirstly embeds the global information by simply using globalaveraging pooling (GAP) to generate channel-wise statisticsas s ∈ RC/2G×1×1, which can be calculated by shrinking Xk1

through spatial dimension H ×W :

s = Fgp(Xk1) =1

H ×W

H∑i=1

W∑j=1

Xk1(i, j) (1)

Furthermore, a compact feature is created to enable guid-ance for precise and adaptive selection. This is achieved by asimple gating mechanism with sigmoid activation. Then, thefinal output of channel attention can be obtained by

X ′k1 = σ(Fc(s)) ·Xk1 = σ(W1s+ b1) ·Xk1 (2)

where W1 ∈ RC/2G×1×1 and b1 ∈ RC/2G×1×1 are parame-

ters used to scale and shift s.

Fig. 3. PyTorch code of the proposed SA module

Spatial Attention. Different from channel attention, spa-tial attention focuses on “where” is an informative part, whichis complementary to channel attention. First, we use GroupNorm (GN)[25] over Xk2 to obtain spatial-wise statistics.Then, Fc(·) is adopted to enhance the representation of Xk2.The final output of spatial attention is obtained by

X ′k2 = σ(W2 ·GN(Xk2) + b2) ·Xk2 (3)

Page 4: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

Fig. 4. Validation on the effectiveness of SA.

where W2 and b2 are parameters with shape RC/2G×1×1.Then the two branches are concatenated to make the num-

ber of channels as the same as the number of input, i.e. X ′k =[X ′k1, X

′k2] ∈ RC/G×H×W .

Aggregation. After that, all the sub-features are aggre-gated. And finally, similar to ShuffleNet v2 [13], we adopt a”channel shuffle” operator to enable cross-group informationflow along the channel dimension. The final output of SAmodule is the same size of X , making SA quite easy to beintegrated with modern architectures.

Note thatW1, b1,W2, b2 and Group Norm hyper-parametersare only parameters introduced in the proposed SA. In a singleSA module, the number of channels in each branch is C/2G.Therefore, the total parameters are 3C/G (typically G is 32 or64), which is trivial compared with the millions of parametersof the entire network, making SA quite lightweight. Theoverall architecture of SA module is illustrated in Figure 2.

Implementation. SA can be easily implemented by a fewlines of code in PyTorch and TensorFlow where automaticdifferentiation is supported. Figure 3 shows the code based onPyTorch.

SA-Net for Deep CNNs. For adopting SA into deepCNNs, we exploit exactly the same configuration withSENet [23] and just replace SE block with SA module.The generated networks are named as SA-Net.

3.1. Visualization and Interpretation

In order to verify whether SA can improve semantic fea-ture representation by utilizing feature grouping and chan-nel shuffle, we first train SA-Net50B (without “channelshuffle”) and SA-Net50 (with “channel shuffle”) on theImageNet-1k training set. Assume I is the original input,we calculate Top-1 accuracy of I × Xk in each group at

Page 5: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

Fig. 5. Sample visualization on ImageNet-1k val split generated by GradCAM. All target layer selected is “layer4.2”.

SA 5 3 (i.e., the last bottleneck according to the followingscheme: SA stageID blockID.) before and after using of theSA module. We use the accuracy scores as indicator and plotthe distribution for different classes (“pug”,“goldfish” and“plane”) in the ImageNet-1k validation set in Figure 4(a, b).For comparison, we also plot the distribution of classificationscores across all the 1000 classes.

As shown in Figure 4 (a, b), the Top-1 accuracy statisticallyincreases after SA, which means feature grouping can signifi-cantly enhance the semantic representations of feature maps.In addition, the average score in each group gains (≈ 0.4%)with “channel shuffle”, which demonstrates the effectivenessof “channel shuffle”.

To fully validate the effectiveness of SA, we plot the distri-bution of average activations (the mean value of the channel-wise feature maps in each group, similar to SE) across threeclasses (“pug”, “goldfish”, and “plane”) at different depths inSA-Net50 (with shuffle). The results are shown in Figure 4(c).We make some observations about the role of SA module: (1)the distribution across different classes is very similar to eachother at the earlier layers(e.g., SA 2 3 and SA 3 4), whichsuggests that the importance of feature groups is likely to beshared by different classes in the early stages; (2) at greaterdepths, the activation of each group becomes much more class-specific as different classes exhibit different performance tothe discriminative value of features(e.g., SA 4 6 and SA 5 3);(3) SA 5 2 exhibits a similar pattern over different classes,which means SA 5 2 is less important than other blocks inproviding recalibration to the network.

In order to validate the effectiveness of SA more intuitively,we sample 9 images from ImageNet-1k val split. We use Grad-CAM [26] to visualize their heatmaps at SA 5 3 on SA-Net50.For comparison, we also draw their heatmaps of ResNet50 at“layer4.2”. As shown in Figure 5, our proposed SA moduleallows the classification model to focus on more relevant re-

gions with more object details, which means the SA modulecan effectively improve the classification accuracy.

Therefore, the proposed SA module is validated to indeedenhance the representation power of networks.

4. EXPERIMENTS

Experiment Setup. All experiments are conducted with ex-actly the same data augmentation and hyper-parameters set-tings in [17] and [23]. Specifically, the input images are ran-domly cropped to 224× 224 with random horizontal flipping.The numbers of groups G in SA-Net and SGE-Net are bothset 64. The initialization of parameters in Fc(·) are set to 0 forweights (W1 and W2) and 1 for biases (b1 and b2) to obtainbetter results. All the architectures are trained from scratch bySGD with weight decay 1e-4, momentum 0.9, and mini-batchsize 256 (using 8 GPUs with 32 images per GPU) for 100epochs, starting from the initial learning rate 0.1 (with a linearwarm-up [27] of 5 epochs) and decreasing it by a factor of10 every 30 epochs. For the testing on the validation set, theshorter side of an input image is first resized to 256, and acenter crop of 224× 224 is used for evaluation.

Classification on ImageNet-1k. We compare SA-Netwith the current SOTA attention methods. Evaluation metricsinclude both efficiency(i.e., network parameters and GFLOPs)and effectiveness(i.e., Top-1/Top-5 accuracy). As shown in Ta-ble 1, SA-Net shares almost the same model complexity (i.e.,network parameters and FLOPs) with the original ResNet [17],but achieves 1.34% gains in terms of Top-1 accuracy and0.89% advantages over Top-5 accuracy (on ResNet-50). Whenusing ResNet-101 as the backbone, the performance gains are0.76% and 0.59%, respectively. Comparing with SOTA coun-terparts, SA obtains higher accuracy while benefiting lower orsimilar model complexity. Specifically, when using ResNet-101 as backbone, the recently SOTA SE [23] module increases

Page 6: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

Table 1. Comparisons of different attention methods on ImageNet-1k in terms of network parameters (Param.), giga floatingpoint operations per second (GFLOPs), and Top-1/Top-5 accuracy (in %). The best records and the improvements are marked inbold and ↑, respectively.

Attention Methods Backbones Param. GFLOPs Top-1 Acc (%) Top-5 Acc (%)

ResNet [17]

ResNet-50

25.557M 4.122 76.384 92.908SENet [23] 28.088M 4.130 77.462 93.696CBAM [10] 28.090M 4.139 77.626 93.660

SGE-Net [12] 25.559M 4.127 77.584 93.664ECA-Net [?] 25.557M 4.127 77.480 93.680

SA-Net (Ours) 25.557M 4.125 77.724 (↑ 1.34) 93.798 (↑ 0.89)

ResNet [17]

ResNet-101

44.549M 7.849 78.200 93.906SENet [23] 49.327M 7.863 78.468 94.102CBAM [10] 49.330M 7.879 78.354 94.064

SGE-Net [12] 44.553M 7.858 78.798 94.368ECA-Net [?] 44.549M 7.858 78.650 94.340

SA-Net (Ours) 44.551M 7.854 78.960 (↑ 0.76) 94.492 (↑ 0.59)

4.778M parameters, 14.34FLOPs, and 0.268 top-1 accuracy,but our SA only increase 0.002M parameters, 5.12FLOPs, andtop-1 0.76 accuracy, which can demonstrate that SA is lighterand more efficient.

Table 2. Performance comparisons of SA-Net (using ResNet-50 as backbones) with four options (i.e., eliminating GroupNorm, eliminating Channel Shuffle, eliminating Fc(·) andutilizing 1 × 1 Conv to replace Fc(·)) on ImageNet-1k interms of GFLOPs and Top-1/Top-5 accuracy (in %). The bestrecords are marked in bold.

Methods GFLOPs Top-1 Acc (%) Top-5 Acc (%)

origin 4.125 77.724 93.798w/o gn 4.125 77.372 93.804w/o shuffle 4.125 77.598 93.758w/o Fc(·) 4.125 77.608 93.8861× 1 Conv 4.140 77.684 93.840

Ablation Study. We report the ablation studies of SA-Net50 on ImageNet-1k, to thoroughly investigate the compo-nents of the SA. As shown in Table 2, the performance dropssignificantly when eliminating Group Norm, which indicatesthe distribution of features generated by different samplesfrom the same semantic group is inconsistent. It is difficultto learn robust significant coefficients without normalization.When eliminating channel shuffle, the performance drops alittle, which demonstrates that information communicationamong different groups can enhance the features representa-tion. There is no doubt if Fc(·) is eliminated, the performancedrops since Fc(·) is adopted to enhance the representation of

features. However, the performance does not improve if we re-place Fc(·) with 1× 1 Conv. This may because the number ofchannels in each sub-feature is too few, so that it is unnecessaryto exchange information among different channels.

Object Detection on MS COCO. We further train the cur-rent SOTA detectors on the COCO train2017 split, and evaluatebounding box Average Precision (AP) for object detection, andmask AP for instance segmentation. We implement all detec-tors using the MMDetection toolkit with the default settingsand trained them within 12 epochs(namely, ‘1× schedule’).For a fair comparison, we only replace the pre-trained back-bones on ImageNet-1k and transfer them to MS COCO byfine-tuning, keeping the other components in the entire detec-tor intact. As shown in Table 3, integration of either SE blockor the proposed SA module can improve the performance ofobject detection using either one-stage or two-stage detectorby a clear margin. Meanwhile, our SA can outperform SEblock with lower model complexity. Specifically, adoptingFaster R-CNN [28] as the basic detector, SA outperforms SEby 1.0% and 1.4% in terms of AP by using ResNet-50 andResNet-101, respectively. If we use RetinaNet as the basedetector, the gains both increase by 1.5%.

Instance Segmentation on MS COCO. Instance segmen-tation using Mask R-CNN [29] on MS COCO are shown inTable 4. As shown in Table 4, SA module achieves clearlyimprovement over the original ResNet and performs betterthan other state-of-the-arts attention modules (i.e., SE block,ECA module, and SGE unit), with less model complexity. Par-ticularly, SA module achieves more gains for small objects,which are usually more difficult to be correctly detected andsegmented. These results verify our SA module has goodgeneralization for various computer vision tasks.

Page 7: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

Table 3. Object detection results of different attention methods on COCO val2017. The best records and the improvements aremarked in bold and ↑, respectively.

Methods Detectors Param. GFLOPs AP50:95 AP50 AP75 APS APM APL

ResNet-50

Faster R-CNN

41.53M 207.07 36.4 58.4 39.1 21.5 40.0 46.6+ SE 44.02M 207.18 37.7 60.1 40.9 22.9 41.9 48.2+ SA (Ours) 41.53M 207.35 38.7 (↑ 2.3) 61.2 41.4 22.3 42.5 49.8

ResNet-101 60.52M 283.14 38.5 60.3 41.6 22.3 43.0 49.8+ SE 65.24M 283.33 39.6 62.0 43.1 23.7 44.0 51.4+ SA (Ours) 60.53M 283.60 41.0(↑ 2.5) 62.7 44.8 24.4 45.1 52.5

ResNet-50

Mask R-CNN

44.18M 275.58 37.3 59.0 40.2 21.9 40.9 48.1+ SE 46.67M 275.69 38.7 60.9 42.1 23.4 42.7 50.0+ SA (Ours) 44.18M 275.86 39.4(↑ 2.1) 61.5 42.6 23.4 42.8 51.1

ResNet-101 63.17M 351.65 39.4 60.9 43.3 23.0 43.7 51.4+ SE 67.89M 351.84 40.7 62.5 44.3 23.9 45.2 52.8+ SA (Ours) 63.17M 352.10 41.6(↑ 2.2) 63.0 45.5 24.9 45.5 54.2

ResNet-50

RetinaNet

37.74M 239.32 35.6 55.5 38.3 20.0 39.6 46.8+ SE 40.25M 239.43 36.0 56.7 38.3 20.5 39.7 47.7+ SA (Ours) 37.74M 239.60 37.5(↑ 1.9) 58.5 39.7 21.3 41.2 45.9

ResNet-101 56.74M 315.39 37.7 57.5 40.4 21.1 42.2 49.5+ SE 61.49M 315.58 38.8 59.3 41.7 22.1 43.2 51.5+ SA (Ours) 56.64M 315.85 40.3(↑ 2.6) 61.2 43.2 23.2 44.4 53.5

Table 4. Instance segmentation results of various state-of-the-arts attention modules using Mask R-CNN on COCO val2017.

Methods AP50:95 AP50 AP75 APS APM APL

ResNet-50 34.2 55.9 36.2 18.2 37.5 46.3+ SE 35.4 57.4 37.8 17.1 38.6 51.8+ ECA 35.6 58.1 37.7 17.6 39.0 51.8+ SGE 34.9 56.9 37.0 19.1 38.4 47.3+ SA (Ours) 36.1(↑ 1.9) 58.7 38.2 19.4 39.4 49.0

ResNet-101 35.9 57.7 38.4 19.2 39.7 49.7+ SE 36.8 59.3 39.2 17.2 40.3 53.6+ ECA 37.4 59.9 39.8 18.1 41.1 54.1+ SGE 36.9 59.3 39.4 20.0 40.8 50.1+ SA (Ours) 38.0(↑ 2.1) 60.0 40.3 20.8 41.2 51.7

5. CONCLUSION

In this paper, we propose a novel efficient attention moduleSA to enhance the representation power of CNN networks.SA first groups the dimensions of channel into multiple sub-features before processing them in parallel. Then, for eachsub-feature, SA utilizes a Shuffle Unit to capture feature de-

pendencies in both spatial and channel dimensions. Afterward,all sub-features are aggregated, and finally, we adopt a “chan-nel shuffle” operator to make information communication be-tween different sub-features. Experimental results demonstratethat our SA is an extremely lightweight plug-and-play block,which can significantly improve the performance of variousdeep CNN architectures.

In the future, we will further explore the spatial and chan-nel attention modules of SA and adopt them into more CNNarchitectures, including the ShuffleNet family, the SKNet [2],and MobileNetV3 [20].

6. REFERENCES

[1] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and HanHu, “Gcnet: Non-local networks meet squeeze-excitationnetworks and beyond,” CoRR, vol. abs/1904.11492,2019.

[2] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang,“Selective kernel networks,” in IEEE Conference onComputer Vision and Pattern Recognition, CVPR 2019,Long Beach, CA, USA, June 16-20, 2019, 2019, pp. 510–519.

[3] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-wei Fang, and Hanqing Lu, “Dual attention network for

Page 8: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

scene segmentation,” in IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2019, Long Beach,CA, USA, June 16-20, 2019, 2019, pp. 3146–3154.

[4] Yuhui Yuan and Jingdong Wang, “Ocnet: Object contextnetwork for scene parsing,” CoRR, vol. abs/1809.00916,2018.

[5] HyunJae Lee, Hyo-Eun Kim, and Hyeonseob Nam,“SRM : A style-based recalibration module for convo-lutional neural networks,” CoRR, vol. abs/1903.10829,2019.

[6] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi,Chen Change Loy, Dahua Lin, and Jiaya Jia, “Psanet:Point-wise spatial attention network for scene parsing,”in Computer Vision - ECCV 2018 - 15th European Con-ference, Munich, Germany, September 8-14, 2018, Pro-ceedings, Part IX, 2018, pp. 270–286.

[7] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang,Zhouchen Lin, and Hong Liu, “Expectation-maximization attention networks for semantic segmenta-tion,” CoRR, vol. abs/1907.13426, 2019.

[8] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, andXiang Bai, “Asymmetric non-local neural networks forsemantic segmentation,” CoRR, vol. abs/1908.07678,2019.

[9] Zilong Huang, Xinggang Wang, Lichao Huang, ChangHuang, Yunchao Wei, and Wenyu Liu, “Ccnet: Criss-cross attention for semantic segmentation,” CoRR, vol.abs/1811.11721, 2018.

[10] Sanghyun Woo, Jongchan Park, Joon-Young Lee, andIn So Kweon, “CBAM: convolutional block attentionmodule,” in Computer Vision - ECCV 2018 - 15th Eu-ropean Conference, Munich, Germany, September 8-14,2018, Proceedings, Part VII, 2018, pp. 3–19.

[11] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li,Wangmeng Zuo, and Qinghua Hu, “Eca-net: Efficientchannel attention for deep convolutional neural networks,”in 2020 IEEE/CVF Conference on Computer Vision andPattern Recognition, CVPR 2020, Seattle, WA, USA, June13-19, 2020. 2020, pp. 11531–11539, IEEE.

[12] Xiang Li, Xiaolin Hu, and Jian Yang, “Spatial group-wise enhance: Improving semantic feature learning inconvolutional networks,” CoRR, vol. abs/1905.09646,2019.

[13] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and JianSun, “Shufflenet V2: practical guidelines for efficientCNN architecture design,” in Computer Vision - ECCV2018 - 15th European Conference, Munich, Germany,September 8-14, 2018, Proceedings, Part XIV, 2018, pp.122–138.

[14] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and JianSun, “Shufflenet: An extremely efficient convolutionalneural network for mobile devices,” in 2018 IEEE Con-ference on Computer Vision and Pattern Recognition,CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018.2018, pp. 6848–6856, IEEE Computer Society.

[15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jonathon Shlens, and Zbigniew Wojna, “Rethinkingthe inception architecture for computer vision,” in 2016IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2016, Las Vegas, NV, USA, June 27-30,2016, 2016, pp. 2818–2826.

[16] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, andAlexander A. Alemi, “Inception-v4, inception-resnetand the impact of residual connections on learning,” inProceedings of the Thirty-First AAAI Conference on Ar-tificial Intelligence, February 4-9, 2017, San Francisco,California, USA, 2017, pp. 4278–4284.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June27-30, 2016, 2016, pp. 770–778.

[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton, “Imagenet classification with deep convolutionalneural networks,” in Advances in Neural InformationProcessing Systems 25: 26th Annual Conference on Neu-ral Information Processing Systems 2012. Proceedings ofa meeting held December 3-6, 2012, Lake Tahoe, Nevada,United States, 2012, pp. 1106–1114.

[19] Mark Sandler, Andrew G. Howard, Menglong Zhu, An-drey Zhmoginov, and Liang-Chieh Chen, “Mobilenetv2:Inverted residuals and linear bottlenecks,” in 2018 IEEEConference on Computer Vision and Pattern Recognition,CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018,2018, pp. 4510–4520.

[20] Andrew Howard, Ruoming Pang, Hartwig Adam,Quoc V. Le, Mark Sandler, Bo Chen, Weijun Wang,Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Va-sudevan, and Yukun Zhu, “Searching for mobilenetv3,”in 2019 IEEE/CVF International Conference on Com-puter Vision, ICCV 2019, Seoul, Korea (South), October27 - November 2, 2019. 2019, pp. 1314–1324, IEEE.

[21] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton,“Dynamic routing between capsules,” in Advances inNeural Information Processing Systems 30: Annual Con-ference on Neural Information Processing Systems 2017,4-9 December 2017, Long Beach, CA, USA, 2017, pp.3856–3866.

Page 9: SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

[22] Geoffrey E. Hinton, Sara Sabour, and Nicholas Frosst,“Matrix capsules with EM routing,” in 6th InternationalConference on Learning Representations, ICLR 2018,Vancouver, BC, Canada, April 30 - May 3, 2018, Confer-ence Track Proceedings. 2018, OpenReview.net.

[23] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitationnetworks,” in 2018 IEEE Conference on Computer Visionand Pattern Recognition, CVPR 2018, Salt Lake City, UT,USA, June 18-22, 2018, 2018, pp. 7132–7141.

[24] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, andKaiming He, “Non-local neural networks,” in 2018 IEEEConference on Computer Vision and Pattern Recognition,CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018,2018, pp. 7794–7803.

[25] Yuxin Wu and Kaiming He, “Group normalization,” inComputer Vision - ECCV 2018 - 15th European Confer-ence, Munich, Germany, September 8-14, 2018, Proceed-ings, Part XIII, 2018, pp. 3–19.

[26] Ramprasaath R. Selvaraju, Michael Cogswell, AbhishekDas, Ramakrishna Vedantam, Devi Parikh, and DhruvBatra, “Grad-cam: Visual explanations from deep net-works via gradient-based localization,” in IEEE Inter-national Conference on Computer Vision, ICCV 2017,Venice, Italy, October 22-29, 2017. 2017, pp. 618–626,IEEE Computer Society.

[27] Priya Goyal, Piotr Dollar, Ross B. Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,Yangqing Jia, and Kaiming He, “Accurate, large mini-batch SGD: training imagenet in 1 hour,” CoRR, vol.abs/1706.02677, 2017.

[28] Shaoqing Ren, Kaiming He, Ross B. Girshick, and JianSun, “Faster R-CNN: towards real-time object detectionwith region proposal networks,” in NIPS 2015, December7-12, 2015, Montreal, Quebec, Canada, Corinna Cortes,Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama,and Roman Garnett, Eds., 2015, pp. 91–99.

[29] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B.Girshick, “Mask R-CNN,” in IEEE International Con-ference on Computer Vision, ICCV 2017, Venice, Italy,October 22-29, 2017, 2017, pp. 2980–2988.