contrastive attention mechanism for abstractive sentence ... · of a given source sentence. the...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 3044–3053,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

3044

Contrastive Attention Mechanism for Abstractive SentenceSummarization

Xiangyu Duan1,2∗, Hongfei Yu2∗, Mingming Yin2, Min Zhang1,2, Weihua Luo3, Yue Zhang4

1 Institute of Artificial Intelligence, Soochow University, Suzhou, China2 School of Computer Science and Technology, Soochow University, Suzhou, China

3 Alibaba DAMO Academy, Hangzhou, China4 School of Engineering, Westlake University, China

[email protected]; {hfyu,mmyin}@stu.suda.edu.cn; [email protected]@alibaba-inc.com; [email protected]

Abstract

We propose a contrastive attention mechanismto extend the sequence-to-sequence frame-work for abstractive sentence summarizationtask, which aims to generate a brief summaryof a given source sentence. The proposed con-trastive attention mechanism accommodatestwo categories of attention: one is the conven-tional attention that attends to relevant parts ofthe source sentence, the other is the opponentattention that attends to irrelevant or less rel-evant parts of the source sentence. Both at-tentions are trained in an opposite way so thatthe contribution from the conventional atten-tion is encouraged and the contribution fromthe opponent attention is discouraged througha novel softmax and softmin functionality. Ex-periments on benchmark datasets show that,the proposed contrastive attention mechanismis more focused on the relevant parts forthe summary than the conventional attentionmechanism, and greatly advances the state-of-the-art performance on the abstractive sen-tence summarization task. We release the codeat https://github.com/travel-go/Abstractive-Text-Summarization.

1 Introduction

Abstractive sentence summarization aims at gen-erating concise and informative summaries basedon the core meaning of source sentences. Previ-ous endeavors tackle the problem through eitherrule-based methods (Dorr et al., 2003) or sta-tistical models trained on relatively small scaletraining corpora (Banko et al., 2000). Followingits successful applications on machine translation(Sutskever et al., 2014; Bahdanau et al., 2015), thesequence-to-sequence framework is also appliedon the abstractive sentence summarization task us-ing large-scale sentence summary corpora (Rushet al., 2015; Chopra et al., 2016; Nallapati et al.,

∗ Equal contribution.

2016), obtaining better performance compared tothe traditional methods.

One central component in state-of-the-art se-quence to sequence models is the use of atten-tion for building connections between the sourcesequence and target words, so that a more in-formed decision can be made for generating a tar-get word by considering the most relevant partsof the source sequence (Bahdanau et al., 2015;Vaswani et al., 2017). For abstractive sentencesummarization, such attention mechanisms can beuseful for selecting the most salient words for ashort summary, while filtering the negative influ-ence of redundant parts.

We consider improving abstractive summariza-tion quality by enhancing target-to-source atten-tion. In particular, a contrastive mechanism istaken, by encouraging the contribution from theconventional attention that attends to relevant partsof the source sentence, while at the same time pe-nalizing the contribution from an opponent atten-tion that attends to irrelevant or less relevant parts.Contrastive attention was first proposed in com-puter vision (Song et al., 2018a), which is used forperson re-identification by attending to person andbackground regions contrastively. To our knowl-edge, we are the first to use contrastive attentionfor NLP and deploy it in the sequence-to-sequenceframework.

In particular, we take Transformer (Vaswaniet al., 2017) as the baseline summarization model,and enhance it with a proponent attention mod-ule and an opponent attention module. The for-mer acts as the conventional attention mechanism,while the latter can be regarded as a dual module tothe former, with similar weight calculation struc-ture, but using a novel softmin function to discour-age contributions from irrelevant or less relevantwords.

To our knowledge, we are the first to investigate

https://github.com/travel-go/Abstractive-Text-Summarization

https://github.com/travel-go/Abstractive-Text-Summarization

3045

Transformer as a sequence to sequence summa-rizer. Results on three benchmark datasets showthat it gives highly competitive accuracies com-pared with RNN and CNN alternatives. Whenequipped with the proposed contrastive attentionmechanism, our Transformer model achieves thebest reported results on all data. The visualizationof attentions shows that through using the con-trastive attention mechanism, our attention is morefocused on relevant parts than the baseline. We re-lease our code at XXX.

2 Related Work

Automatic summarization has been investigatedin two main paradigms: the extractive methodand the abstractive method. The former extractsimportant pieces of source document and con-catenates them sequentially (Jing and McKeown,2000; Knight and Marcu, 2000; Neto et al., 2002),while the latter grasps the core meaning of thesource text and re-state it in short text as abstrac-tive summary (Banko et al., 2000; Rush et al.,2015). In this paper, we focus on abstractive sum-marization, and especially on abstractive sentencesummarization.

Previous work deals with the abstractive sen-tence summarization task by using either rulebased methods (Dorr et al., 2003), or statisticalmethods utilizing a source-summary parallel cor-pus to train a machine translation model (Bankoet al., 2000), or a syntax based transduction model(Cohn and Lapata, 2008; Woodsend et al., 2010).

In recent years, sequence-to-sequence neuralframework becomes predominant on this task byencoding long source texts and decoding intoshort summaries together with the attention mech-anism. RNN is the most commonly adopted andextensively explored architecture (Chopra et al.,2016; Nallapati et al., 2016; Li et al., 2017). ACNN-based architecture is recently employed byGehring et al. (2017) using ConvS2S, which ap-plies CNN on both encoder and decoder. Later,Wang et al. (2018) build upon ConvS2S with topicwords embedding and encoding, and train the sys-tem with reinforcement learning.

The most related work to our contrastive atten-tion mechanism is in the field of computer vision.Song et al. (2018a) first propose the contrastiveattention mechanism for person re-identification.In their work, based on a pre-provided personand background segmentation, the two regions are

contrastively attended so that they can be eas-ily discriminated. In comparison, we apply thecontrastive attention mechanism for sentence levelsummarization by contrastively attending to rel-evant parts and irrelevant or less relevant parts.Furthermore, we propose a novel softmax soft-min functionality to train the attention mechanism,which is different to Song et al. (2018a), who usemean squared error loss for attention training.

Other explorations with respect to the charac-teristics of the abstractive summarization task in-clude copying mechanism that copies words fromsource sequences for composing summaries (Guet al., 2016; Gulcehre et al., 2016; Song et al.,2018b), the selection mechanism that elaboratelyselects important parts of source sentences (Zhouet al., 2017; Lin et al., 2018), the distraction mech-anism that avoids repeated attention on the samearea (Chen et al., 2016), and the sequence leveltraining that avoids exposure bias in teacher forc-ing methods (Ayana et al., 2016; Li et al., 2018;Edunov et al., 2018). Such methods are built onconventional attention, and are orthogonal to ourproposed contrastive attention mechanism.

3 Approach

We use two categories of attention for summarygeneration. One is the conventional attention thatattends to relevant parts of source sentence, theother is the opponent attention that contrarily at-tends to irrelevant or less relevant parts. Both cate-gories of attention output probability distributionsover summary words, which are jointly optimizedby encouraging the contribution from the conven-tional attention and discouraging the contributionfrom the opponent attention.

Figure 1 illustrates the overall networks. We useTransformer architecture as our basis, upon whichwe build the contrastive attention mechanism. Theleft part is the original Transformer. We derivethe opponent attention from the conventional at-tention which is the encoder-decoder attention ofthe original Transformer, and stack several layerson top of the opponent attention as shown in theright part of Figure 1. Both parts contribute to thesummary generation by producing probability dis-tributions over the target vocabulary, respectively.The left part outputs the conventional probabil-ity based on the conventional attention as the orig-inal Transformer does, while the right part outputsthe opponent probability based on the opponent

3046

SourceEmbedding

TargetEmbedding

Linear

Softmax

Feed‐Forward

Add & Norm

Add & Norm

Self‐Attention

Encoder‐DecoderAttention

Add & Norm

Feed‐Forward

Add & Norm

Add & Norm

Self‐Attention

Attentionindices

Conventional Probabilities

Nx

Nx

Feed‐Forward

Add & Norm

Linear

Softmin

OpponentAttention

Norm

Opponent Probabilities

Transformer Contrastive Attention Mechanism

Figure 1: Overall networks. The left part is the originalTransformer. The right part that takes the opponent at-tention as bottom layer fulfils the contrastive attentionmechanism.

attention. The two probabilities in Figure 1 arejointly optimized in a novel way as explained inSection 3.3.

3.1 Transformer for Abstractive SentenceSummarization

Transformer is an attention network basedsequence-to-sequence architecture (Vaswaniet al., 2017), which encodes the source text intohidden vectors and decodes into the target textbased on the source side information and thetarget generation history. In comparison to theRNN based architecture and the CNN basedarchitecture, both the encoder and the decoder ofTransformer adopt attention as main function.

Let X and Y denote the source sentenceand its summary, respectively. Transformer istrained to maximize the probability of Y given X:∏

i Pc(yi|yi−11 , X), where Pc(yi|yi−1

1 , X) is theconventional probability of the current summaryword yi given the source sentence and the sum-mary generation history. Pc is computed based onthe attention mechanism and the stacked deep lay-ers as shown in the left part of Figure 1.

Attention MechanismScaled dot-product attention is applied in Trans-former:

attention(Q,K, V ) = softmax(QKT

√dk

)V (1)

where Q,K, V denotes query vector, key vectors,and value vectors, respectively. dk denotes the di-mension of one vector of K. Softmax functionoutputs the attention weights distributed over V .attention(Q,K, V ) is a vector of weighted sumof elements of V , and represents current contextinformation.

We focus on the encoder-decoder attention,which builds the connection between source andtarget by informing the decoder which area of thesource text should be attended to. Specifically,in the encoder-decoder attention, Q is the singlevector coming from the current position of the de-coder, K and V are the same sequence of vectorsthat are the outcomes of the encoder at all sourcepositions. Softmax function distributes the atten-tion weights over the source positions.

The attentions in Transformer adopts the multi-head implementation, in which each head com-putes attention as Equation (1) but with smallerQ,K, V whose dimension is 1/h times of theiroriginal dimension respectively. The attentionsfrom h heads are concatenated together and lin-early projected to compose the final attention. Inthis way, multi-head attention provides a multi-view of attention behavior beneficial for the finalperformance.

Deep LayersThe “N×” plates in Figure 1 stands for the stackedN identical layers. On the source side, each layerof the stacked N layers contains two sublayers:the self-attention mechanism, and the fully con-nected feed-forward network. Each sublayer em-ploys residual connection that adds input to out-come of sublayer, then layer normalization is em-ployed on the outcome of the residual connection.

On the target summary side, each layer containsan additional sublayer of the encoder-decoder at-tention between the self-attention sublayer and thefeed-forward sublayer. At the top of the decoder,the softmax layer is applied to convert the decoderoutput to summary word generation probabilities.

3.2 Contrastive Attention Mechanism

3.2.1 Opponent AttentionAs illustrated in Figure 1, the opponent attentionis derived from the conventional encoder-decoderattention. Since the multi-head attention is em-ployed in Transformer, there are N×h heads in to-tal in the conventional encoder-decoder attention,where N denotes the number of layers, h denotes

3047

(a)

(b)

Figure 2: Heatmaps of two sampled heads from theconventional encoder-decoder attention. (a) is of thefifth head of the third layer, and (b) is of the fifth headof the first layer.

the number of heads in each layer. These headsexhibit diverse attention behaviors, posing a chal-lenge of determining which head to derive the op-ponent attention, so that it attends to irrelevant orless relevant parts.

Figure 2 illustrates the attention weights of twosampled heads. The attention weights in (a) wellreflect the word level relevant relation between thesource sentence and the target summary, while at-tention weights in (b) do not. We find that suchbehavior characteristic of each head is fixed. Forexample, head (a) always exhibits the relevant re-lation across different sentences and different runs.Based on depicting heatmaps of all heads for a fewsentences, we choose the head that correspondswell to the relevant relation between source andtarget to derive the opponent attention 1.

Specifically, let αc denote the conventionalencoder-decoder attention weights of the headwhich is used for deriving the opponent attention:

αc = softmax(qkT√dk

) (2)

where q and k are from the head same to that of αc.Let αo denote the opponent attention weights. It isobtained through the opponent function applied onαc followed by the softmax function:

αo = softmax(opponent(αc)) (3)

The opponent function in Equation (3) performsa masking operation, which finds the maximumweight in αc, and replaces it with the negative

1Given manual alignments between source and target ofsampled sentence-summary pairs, we select the head that hasthe lowest alignment error rate (AER) of its attention weights.

infinity value, so that the softmax function out-puts zero given the negative infinity value input.Then the maximum weight in αc is set zero inαo after the opponent and softmax functions. Inthis way, the most relevant part of the source se-quence, which receives maximum attention in theconventional attention weights αc, is masked andneglected in αo. Instead, the remaining less rele-vant or irrelevant parts are extracted into αo for thefollowing contrastive training and decoding.

We also tried other methods to calculate theopponent attention weights, such as αo =softmax(1 − αc) (Song et al., 2018a) 2 or αo =softmax(1/αc), which aims to make αo contraryto αc, but they underperform the masking oppo-nent function on all benchmark datasets. So wepresent only the masking opponent in the follow-ing sections.

After αo is obtained via Equation (3), the op-ponent attention is: attentiono = αov, where v isfrom the head same to that of q and k in computingαc.

Compared to the conventional attentionattentionc, which summarizes current relevantcontext, attentiono summarizes current irrelevantor less relevant context. They constitute a con-trastive pair, and contribute together for the finalsummary word generation.

3.2.2 Opponent ProbabilityThe opponent probability Po(yi|yi−1

1 , X) is com-puted by stacking several layers on top ofattentiono, and a softmin layer in the end asshown in the right part of Figure (1). In particu-lar,

z1 = LayerNorm(attentiono) (4)z2 = FeedForward(z1) (5)z3 = LayerNorm(z1 + z2) (6)

Po(yi|yi−11 , X) = softmin(Wz3) (7)

whereW is the matrix of the linear projection sub-layer.attentiono contributes to Po via Equation (4-7)

step by step. The LayerNorm and FeedForwardlayers with residual connection is similar to the

2Song et al. (2018a) directly let αo = 1−αc in extractingbackground features for person re-identification in computervision. We have to add softmax function since the attentionweights must be normalized to one in sequence-to-sequenceframework.

3048

original Transformer, while a novel softmin func-tion is introduced in the end to invert the contribu-tion from attentiono:

softmin(vi) =e(−vi)∑j e(−vj)

(8)

where v = Wz3, i.e., the input vector to the soft-min function in Equation (7). Softmin normalizesv so that scores of all words in the summary vo-cabulary sum to one. We can see that the biggerthe vi, the smaller the Po,i is.

Softmin functions contrarily to softmax. Asa result, when we try to maximize Po(yi =y|yi−1

1 , X), where y is the gold summary word, weeffectively search for an appropriate attentiono togenerate the lowest vg, where g is the index of y inv. It means that the more irrelevant is attentionoto the summary, the lower the vg can be obtained,resulting in higher Po.

3.3 Training and Decoding

During training, we jointly maximize the conven-tional probability Pc and the opponent probabilityPo:

J = log(Pc(yi|yi−11 , X) + λlog(Po(yi|yi−1

1 , X)(9)

where λ is the balanced weight. The conventionalprobability is computed as the original Trans-former does, basing on sublayers of feed-forward,linear projection, and softmax stacked over theconventional attention as illustrated in the left partof Figure 1. The opponent probability is based onsimilar sublayers stacked over the opponent atten-tion, but with softmin as the last sublayer as illus-trated in the right part of Figure 1.

Due to the contrary properties of softmax andsoftmin, jointly maximizing Pc and Po actuallymaximizes the contribution from the conventionalattention for summary word generation, while atthe same time minimizes the contribution from theopponent attention3. In other words, the train-ing objective is to let the relevant part attended

3We also tried replacing softmin in Equation (7) with soft-max, and correspondingly setting the training objective asmaximizing J = log(Pc)−λlog(Po), but this method failedto train because Po becomes too small during training, andresults in negative infinity value of log(Po) that hampers thetraining. In comparison, softmin and the training objective ofEquation (9) do not have such problem, enabling the effectivetraining of the proposed network.

by attentionc contribute more to the summariza-tion, while let the irrelevant or less relevant partsattended by attentiono contribute less.

During decoding, we aim to find maximum Jof Equation (9) in the beam search process.

4 Experiments

We conduct experiments on abstractive sentencesummarization benchmark datasets to demonstratethe effectiveness of the proposed contrastive atten-tion mechanism.

4.1 Datasets

In this paper, we evaluate our proposed method onthree abstractive text summarization benchmarkdatasets. First, we use the annotated Gigawordcorpus and preprocess it identically to Rush etal. (2015), which results in around 3.8M train-ing samples, 190K validation samples and 1951test samples for evaluation. The source-summarypairs are formed through pairing the first sentenceof each article with its headline. We use DUC-2004 as another English data set only for testing inour experiments. It contains 500 documents, eachcontaining four human-generated reference sum-maries. The length of the summary is capped at 75bytes. The last data set we used is a large corpus ofChinese short text summarization (LCSTS) (Huet al., 2015), which is collected from the Chinesemicroblogging website Sina Weibo. We follow thedata split of the original paper, with 2.4M source-summary pairs from the first part of the corpus fortraining, 725 pairs from the last part with high an-notation score for testing.

4.2 Experimental Setup

We employ Transformer as our basis architecture4.Six layers are stacked in both the encoder and de-coder, and the dimensions of the embedding vec-tors and all hidden vectors are set 512. The innerlayer of the feed-forward sublayer has the dimen-sionality of 2048. We set eight heads in the multi-head attention. The source embedding, the targetembedding and the linear sublayer are shared inour experiments. Byte-pair encoding is employedin the English experiment with a shared source-target vocabulary of about 32k tokens (Sennrichet al., 2015).

Regarding the contrastive attention mechanism,the opponent attention is derived from the head

4https://github.com/pytorch/fairseq

3049

System Gigaword DUC2004R-1 R-2 R-L R-1 R-2 R-L

ABS (Rush et al., 2015) 29.55 11.32 26.42 26.55 7.06 22.05ABS+ (Rush et al., 2015) 29.76 11.88 26.96 28.18 8.49 23.81RAS-Elman (Chopra et al., 2016) 33.78 15.97 31.15 28.97 8.26 24.06words-lvt5k-1sent (Nallapati et al., 2016) 35.30 16.64 32.62 28.61 9.42 25.24SEASSbeam (Zhou et al., 2017) 36.15 17.54 33.63 29.21 9.56 25.51RNNMRT (Ayana et al., 2016) 36.54 16.59 33.44 30.41 10.87 26.79Actor-Critic (Li et al., 2018) 36.05 17.35 33.49 29.41 9.84 25.85StructuredLoss (Edunov et al., 2018) 36.70 17.88 34.29 - - -DRGD (Li et al., 2017) 36.27 17.57 33.62 31.79 10.75 27.48ConvS2S (Gehring et al., 2017) 35.88 17.48 33.29 30.44 10.84 26.90ConvS2SReinforceTopic (Wang et al., 2018) 36.92 18.29 34.58 31.15 10.85 27.68FactAware (Cao et al., 2018) 37.27 17.65 34.24 - - -Transformer 37.87 18.69 35.22 31.38 10.89 27.18Transformer+ContrastiveAttention 38.72 19.09 35.82 32.22 11.04 27.59

Table 1: ROUGE scores on the English evaluation sets of both Gigaword and DUC2004. On Gigaword, the full-length F-1 based ROUGE scores are reported. On DUC2004, the recall based ROUGE scores are reported. “-”denotes no score is available in that work.

whose attention is most synchronous to wordalignments of the source-summary pair. In ourexperiments, we select the fifth head of the thirdlayer for deriving the opponent attention in the En-glish experiments, and select the second head ofthe third layer in the Chinese experiments. All di-mensions in the contrastive architecture are set 64.The λ in Equation (9) is tuned on the developmentset in each experiment.

During training, we use the Adam optimizerwith β1 = 0.9, β2 = 0.98, ε= 10−9. The initiallearning rate is 0.0005. The inverse square rootschedule is applied for initial warm up and anneal-ing (Vaswani et al., 2017). During training, weuse a dropout rate of 0.3 on all datasets.

During evaluation, we employ ROUGE (Lin,2004) as our evaluation metric. Since stan-dard Rouge package is used to evaluate the En-glish summarization systems, we also follow themethod of Hu et al. (2015) to map Chinese wordsinto numerical IDs in order to evaluate the perfor-mance on the Chinese data set.

4.3 Results

4.3.1 English Results

The experimental results on the English evaluationsets are listed in Table 1. We report the full-lengthF-1 scores of ROUGE-1 (R-1), ROUGE2 (R-2),and ROUGE-L (R-L) on the evaluation set of theannotated Gigaword, while report the recall-basedscores of the R-1, R-2, and R-L on the evaluationset of DUC2004 to follow the setting of the previ-ous works.

The results of our works are shown at the bot-

tom of Table 1. The performances of the re-lated works are reported in the upper part of Ta-ble 1 for comparison. ABS and ABS+ are thepioneer works of using neural models for ab-stractive text summarization. RAS-Elman extendsABS/ABS+ with attentive CNN encoder. words-lvt5k-1sent uses large vocabulary and linguisticfeatures such as POS and NER tags. RNNMRT,Actor-Critic, StructuredLoss are sequence-leveltraining methods to overcome the problem of theusual teacher-forcing methods. DRGD uses re-current latent random model to improve summa-rization quality. FactAware generates summarywords conditioned on both the source text and thefact descriptions extracted from OpenIE or depen-dencies. Besides the above RNN-based relatedworks, CNN-based architectures of ConvS2S andConvS2SReinforceTopic are included for compari-son.

Table 1 shows that we build a strong base-line using Transformer alone which obtains thestate-of-the-art performance on Gigaword evalua-tion set, and obtains comparable performance tothe state-of-the-art on DUC2004. When we in-troduce the contrastive attention mechanism intoTransformer, it significantly improves the perfor-mance of Transformer, and greatly advances thestate-of-the-art on both Gigaword evaluation setand DUC2004, as shown in the row of “Trans-former+Contrastive Attention”.

4.3.2 Chinese ResultsTable 2 presents the evaluation results on LC-STS. The upper rows list the performances of therelated works, the bottom rows list the perfor-

3050

System R-1 R-2 R-LRNN context (Hu et al., 2015) 29.90 17.40 27.20CopyNet (Gu et al., 2016) 34.40 21.60 31.30RNNMRT (Ayana et al., 2016) 38.20 25.20 35.40RNNdistraction (Chen et al., 2016) 35.20 22.60 32.50DRGD (Li et al., 2017) 36.99 24.15 34.21Actor-Critic (Li et al., 2018) 37.51 24.68 35.02Global (Lin et al., 2018) 39.40 26.90 36.50Transformer 41.93 28.28 38.32Transformer+ContrastiveAttention 44.35 30.65 40.58

Table 2: The full-length F-1 based ROUGE scores onthe Chinese evaluation set of LCSTS.

mances of our Transformer baseline and the inte-gration of the contrastive attention mechanism intoTransformer. We only take character sequencesas source-summary pairs and evaluate the per-formance based on reference characters for strictcomparison to the related works.

Table 2 shows that Transformer also sets astrong baseline on LCSTS that surpasses the per-formances of the previous works. When Trans-former is equipped with our proposed contrastiveattention mechanism, the performance is signif-icantly improved and drastically advances thestate-of-the-art on LCSTS.

5 Analysis and Discussion

5.1 Effect of the Contrastive AttentionMechanism on Attentions

Figure 3 shows the attention weights before andafter using the contrastive attention mechanism.We depict the averaged attention weights of allheads in one layer in Figure 3a and 3b to studyhow it contributes to the conventional probabil-ity computation, and depict the opponent attentionweights in Figure 3c to study its contribution tothe opponent probability. Since we select the fifthhead of the third layer to derive the opponent atten-tion in English experiment, the studies are carriedout on the third layer.

Figure 3a is from the baseline Transformer, Fig-ure 3b is from “Transformer + ContrastiveAtten-tion”. We can see that “Transformer + Con-trastiveAttention” is more focused on the sourceparts that are most relevant to the summary thanthe baseline Transformer, which scatters attentionweights on summary word neighbors or even func-tional words such as “-lrb-” and “the”. “Trans-former + ContrastiveAttention” cancels such scat-tered attentions by using the contrastive attentionmechanism.

(a)

(b)

(c)

Figure 3: The attention weight changes by us-ing the contrastive attention mechanism. (a) isthe average attention weights of the third layer ofthe baseline Transformer, (b) is that of “Trans-former+ContrastiveAttention”, and (c) is the opponentattention derived from the fifth head of the third layer.

Figure 3c depicts the opponent attentionweights. They are optimized during training togenerate the lowest score which is fed into soft-min to get the highest opponent probability Po.The more irrelevant to the summary word the op-ponent is, the lower the score can be obtained, thusresulting in higher Po. Figure 3c shows that the at-tentions are formed over irrelevant parts with var-ied weights as the result of maximizing Po duringtraining.

5.2 Effect of the Opponent Probability inDecoding

We study the contribution of the opponent prob-ability Po by dropping it during decoding to seeif it hurts the performance. Table 4 shows thatdropping Po significantly harms the performanceof “Transformer + ContrastiveAtt”. The perfor-mance difference between the model dropping Po

and the baseline Transformer is marginal, indicat-ing that adding the opponent probability Po is keyfor achieving the performance improvement.

5.3 Explorations on Deriving the OpponentAttention

Masking More Attention Weights for Derivingthe Opponent Attention

3051

System Gigaword DUC2004R-1 R-2 R-L R-1 R-2 R-L

mask maximum weight 38.72 19.09 35.82 32.22 11.04 27.59mask top-2 weights 38.17 19.15 35.51 31.87 10.94 27.41mask top-3 weights 38.36 19.11 35.56 31.67 10.37 27.31dynamically mask 38.12 18.92 35.28 31.37 10.32 27.11synchronous head 38.72 19.09 35.82 32.22 11.04 27.59non-synchronous head 37.85 18.59 35.16 31.73 10.74 27.35averaged head 38.43 19.10 35.53 31.82 10.98 27.43Transformer baseline 37.87 18.69 35.22 31.38 10.89 27.18

Table 3: Results of explorations on the opponent attention derivation. The upper part presents the influence ofmasking more attention weights for deriving the opponent attention. The middle part presents the results of select-ing different head for the opponent attention derivation. The bottom row presents the result of Transformer.

Gigaword R-1 R-2 R-LTransformer 37.87 18.69 35.22Transformer+ContrastiveAtt-Po 37.92 18.88 35.21Transformer+ContrastiveAtt 38.72 19.09 35.82DUC2004 R-1 R-2 R-LTransformer 31.38 10.89 27.18Transformer+ContrastiveAtt-Po 31.21 10.70 26.85Transformer+ContrastiveAtt 32.22 11.04 27.59

Table 4: The effect of dropping Po (denoted by -Po)from Transformer+ContrastiveAtt during decoding.

In Section 3.2.1, we mask the most salient wordthat has the maximum weight of αc to derive theopponent attention. In this subsection, we experi-mented with masking more weights of αc by twoways: 1) masking top k weights, 2) dynamicallymasking. In the dynamically masking method, weorder the weights from big to small at first, then goon masking two neighbors until the ratio betweenthem is over a threshold. The threshold is 1.02based on training and tuning on the developmentset.

The upper rows of Table 3 presents the per-formance comparison between masking maximumweight and masking more weights. It shows thatmasking maximum weight performs better, indi-cating that masking the most salient weight leavesmore irrelevant or less relevant words to com-pute the opponent probability Po, which is morereliable than that computed from less remainingwords after masking more weights.

Selecting Non-synchronous Head or AveragedHead for Deriving the Opponent AttentionAs explained in Section 3.2.1, the opponent at-tention is derived from the head that is most syn-chronous to the word alignments between sourcesentence and summary. We denote it “syn-chronous head”. We also explored deriving theopponent attention from the fifth head of the first

layer, which is non-synchronous to the word align-ments as illustrated in Figure 2b. Its result is pre-sented in the “non-synchronous head” row. In ad-dition, the attention weights averaged on all headsof the third layer are used to derive the opponentattention. We denote it “averaged head”.

As shown in the middle part of Table 3,both “non-synchronous head” and “averagedhead” underperform “synchronous head”. “non-synchronous head” performs worst, and evenworse than the Transformer baseline on Gigaword.This indicates that it is better to compose the op-ponent attention from irrelevant parts that can beeasily located in the synchronous head. “averagedhead” performs slightly worse than “synchronoushead”, and is also slower due to the involved allheads.

5.4 Qualitative Study

Table 5 shows the qualitative results. The high-lights in the baseline Transformer manifest theincorrect areas extracted by the baseline sys-tem. In contrast, the highlights in Trans-former+ContrastiveAtt show that correct contentsare extracted since the contrastive system distin-guish relevant parts from irrelevant parts on thesource side and made attending to correct areasmore easily.

6 Conclusion

We proposed a contrastive attention mechanismfor abstractive sentence summarization, using boththe conventional attention that attends to the rel-evant parts of the source sentence, and a novelopponent attention that attends to irrelevant orless relevant parts for the summary word gener-ation. Both categories of the attention constitutea contrastive pair, and we encourage contributionfrom the conventional attention and penalize con-

3052

Src:press freedom in algeria remains at risk despite the release on wednesday of prominent newspaper editor mohamed UNKafter a two-year prison sentence , human rights organizations said .Ref:algerian press freedom at risk despite editor ’s release UNK pictureTransformer:press freedom remains at risk in algeria rights groups sayTransformer+ContrastiveAtt:press freedom remains at risk despite release of algerian editorSrc:denmark ’s poul-erik hoyer completed his hat-trick of men ’s singles badminton titles at the european championships ,winning the final here on saturdayRef:hoyer wins singles titleTransformer:hoyer completes hat-trickTransformer+ContrastiveAtt:hoyer wins men ’s singles titleSrc:french bank credit agricole launched on tuesday a public cash offer to buy the ## percent of emporiki bank it does not alreadyown , in a bid valuing the greek group at #.# billion euros ( #.# billion dollars ) .Ref:credit agricole announces #.#-billion-euro bid for greek bank emporikiTransformer:credit agricole launches public cash offer for greek bankTransformer+ContrastiveAtt:french bank credit agricole bids #.# billion euros for greek bank

Table 5: Example summaries generated by the baseline Transformer and Transformer+ContrastiveAtt.

tribution from the opponent attention through jointtraining. Using Transformer as a strong baseline,experiments on three benchmark data sets showthat the proposed contrastive attention mechanismsignificantly improves the performance, advanc-ing the state-of-the-art performance for the task.

Acknowledgments

The authors would like to thank the anonymousreviewers for the helpful comments. This workwas supported by National Key R&D Program ofChina (Grant No. 2016YFE0132100), NationalNatural Science Foundation of China (Grant No.61525205, 61673289).

ReferencesAyana, Shiqi Shen, Yu Zhao, Zhiyuan Liu, and

Maosong Sun. 2016. Neural headline generationwith sentence-wise optimization. Computer Re-search Repository, arXiv:1604.01904. Version 2.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd InternationalConference on Learning Representations.

Michele Banko, Vibhu O Mittal, and Michael J Wit-brock. 2000. Headline generation based on statisti-cal translation. In Proceedings of the 38th AnnualMeeting on Association for Computational Linguis-tics, pages 318–325.

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.Faithful to the original: Fact aware neural abstrac-tive summarization. In Thirty-Second AAAI Confer-ence on Artificial Intelligence.

Qian Chen, Xiao Dan Zhu, Zhen Hua Ling, Si Wei,and Hui Jiang. 2016. Distraction-based neural net-works for modeling document. In Proceedings ofthe Twenty-Fifth International Joint Conference onArtificial Intelligence, pages 2754–2760.

Sumit Chopra, Michael Auli, and Alexander M Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 93–98.

Trevor Cohn and Mirella Lapata. 2008. Sentence com-pression beyond word deletion. In Proceedingsof the 22nd International Conference on Computa-tional Linguistics-Volume 1, pages 137–144.

Bonnie Dorr, David Zajic, and Richard Schwartz.2003. Hedge trimmer: A parse-and-trim approachto headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume5, pages 1–8.

Sergey Edunov, Myle Ott, Michael Auli, David Grang-ier, and Marc’Aurelio Ranzato. 2018. Classicalstructured prediction losses for sequence to se-quence learning. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 355–364.

Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N Dauphin. 2017. Convolu-tional sequence to sequence learning. In Proceed-ings of the 34th International Conference on Ma-chine Learning, pages 1243–1252.

Jiatao Gu, Zhengdong Lu, Li Hang, and Victor O. K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics.

Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. Lc-sts: A large scale chinese short text summarizationdataset. In Proceedings of the 2015 Conference on

https://arxiv.org/abs/1604.01904

https://arxiv.org/abs/1604.01904

3053

Empirical Methods in Natural Language Process-ing, pages 1967–1972.

Hongyan Jing and Kathleen R McKeown. 2000. Cutand paste based text summarization. In Proceed-ings of the 1st North American chapter of the As-sociation for Computational Linguistics conference,pages 178–185.

Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization - step one: Sentence compres-sion. In Proceedings of the Seventeenth NationalConference on Artificial Intelligence and TwelfthConference on on Innovative Applications of Arti-ficial Intelligence, pages 703–710.

Piji Li, Lidong Bing, and Wai Lam. 2018. Actor-critic based training framework for abstractivesummarization. Computing Research Repository,arXiv:1803.11070.

Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017.Deep recurrent generative decoder for abstractivetext summarization. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2091–2100.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Proc of the ACL-04Workshop on Text Summarization Branches Out.

Junyang Lin, Sun Xu, Shuming Ma, and Su Qi. 2018.Global encoding for abstractive summarization. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, pages 163–169.

Ramesh Nallapati, Bowen Zhou, Cicero Nogueira DosSantos, Caglar Gulcehre, and Xiang Bing. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the20th SIGNLL Conference on Computational Natu-ral Language Learning, pages 280–290.

Joel Larocca Neto, Alex A Freitas, and Celso A. A.Kaestner. 2002. Automatic text summarization us-ing a machine learning approach. In Brazilian Sym-posium on Artificial Intelligence, pages 205–215.

Alexander M Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics.

Chunfeng Song, Yan Huang, Wanli Ouyang, and LiangWang. 2018a. Mask-guided contrastive attentionmodel for person re-identification. In Proceedingsof the IEEE Conference on Computer Vision andPattern Recognition, pages 1179–1188.

Kaiqiang Song, Lin Zhao, and Fei Liu. 2018b.Structure-infused copy mechanisms for abstractivesummarization. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 1717–1729.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, WeiLiu, and Qiang Du. 2018. A reinforced topic-awareconvolutional sequence-to-sequence model for ab-stractive text summarization. In Proceedings of theTwenty-Seventh International Joint Conference onArtificial Intelligence,, pages 4453–4460.

Kristian Woodsend, Yansong Feng, and Mirella Lap-ata. 2010. Title generation with quasi-synchronousgrammar. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing, pages 513–523.

Qingyu Zhou, Yang Nan, Furu Wei, and Zhou Ming.2017. Selective encoding for abstractive sentencesummarization. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics, pages 1095–1104.

http://arxiv.org/abs/1803.11070



contrastive attention mechanism for abstractive sentence ... · of a given source sentence. the...

Documents