vmsmo: learning to generate multimodal summary for video

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9360–9369,November 16–20, 2020. c©2020 Association for Computational Linguistics

9360

VMSMO: Learning to Generate Multimodal Summary forVideo-based News Articles

Mingzhe Li1,2,∗, Xiuying Chen1,2,∗, Shen Gao2,Zhangming Chan1,2, Dongyan Zhao1,2 and Rui Yan1,2,†

1 Center for Data Science, AAIS, Peking University,Beijing,China2 Wangxuan Institute of Computer Technology, Peking University,Beijing,China{li mingzhe,xy-chen,shengao,zhaody,ruiyan}@pku.edu.cn

Abstract

A popular multimedia news format nowadaysis providing users with a lively video anda corresponding news article, which is em-ployed by influential news media includingCNN, BBC, and social media including Twit-ter and Weibo. In such a case, automaticallychoosing a proper cover frame of the video andgenerating an appropriate textual summary ofthe article can help editors save time, and read-ers make the decision more effectively. Hence,in this paper, we propose the task of Video-based Multimodal Summarization with Mul-timodal Output (VMSMO) to tackle such aproblem. The main challenge in this task isto jointly model the temporal dependency ofvideo with semantic meaning of article. Tothis end, we propose a Dual-Interaction-basedMultimodal Summarizer (DIMS), consistingof a dual interaction module and multimodalgenerator. In the dual interaction module,we propose a conditional self-attention mecha-nism that captures local semantic informationwithin video and a global-attention mechanismthat handles the semantic relationship betweennews text and video from a high level. Ex-tensive experiments conducted on a large-scalereal-world VMSMO dataset1 show that DIMSachieves the state-of-the-art performance interms of both automatic metrics and humanevaluations.

1 Introduction

Existing experiments (Li et al., 2017) have proventhat multimodal news can significantly improveusers’ sense of satisfaction for informativeness. Asone of these multimedia data forms, introducingnews events with video and textual descriptions is

∗Equal contribution. Ordering is decided by a coin flip.†Corresponding author.

1https://github.com/yingtaomj/VMSMO

Figure 1: An example of video-based multimodal sum-marization with multimodal output.

becoming increasingly popular, and has been em-ployed as the main form of news reporting by newsmedia including BBC, Weibo, CNN, and DailyMail. An illustration is shown in Figure 1, wherethe news contains a video with a cover picture anda full news article with a short textual summary. Insuch a case, automatically generating multimodalsummaries, i.e., choosing a proper cover frame ofthe video and generating an appropriate textualsummary of the article can help editors save timeand readers make decisions more effectively.

There are several works focusing on multimodalsummarization. The most related work to ours is(Zhu et al., 2018), where they propose the taskof generating textual summary and picking themost representative picture from 6 input candidates.However, in real-world applications, the input isusually a video consisting of hundreds of frames.Consequently, the temporal dependency in a videocannot be simply modeled by static encoding meth-ods. Hence, in this work, we propose a novel task,Video-based Multimodal Summarization with Mul-timodal Output (VMSMO), which selects coverframe from news video and generates textual sum-mary of the news article in the meantime.

9361

The cover image of the video should be thesalient point of the whole video, while the textualsummary should also extract the important infor-mation from source articles. Since the video andthe article focus on the same event with the samereport content, these two information formats com-plement each other in the summarizing process.However, how to fully explore the relationship be-tween temporal dependency of frames in video andsemantic meaning of article still remains a prob-lem, since the video and the article come from twodifferent space.

Hence, in this paper, we propose a model namedDual-Interaction-based Multimodal Summarizer(DIMS), which learns to summarize article andvideo simultaneously by conducting a dual interac-tion strategy in the process. Specifically, we firstemploy Recurrent Neural Networks (RNN) to en-code text and video. Note that by the encodingRNN, the spatial and temporal dependencies be-tween images in the video are captured. Next, wedesign a dual interaction module to let the videoand text fully interact with each other. Specifically,we propose a conditional self-attention mechanismwhich learns local video representation under theguidance of article, and a global-attention mech-anism to learn high-level representation of video-aware article and article-aware video. Last, themultimodal generator generates the textual sum-mary and extracts the cover image based on the fu-sion representation from the last step. To evaluatethe performance of our model, we collect the firstlarge-scale news article-summary dataset associ-ated with video-cover from social media websites.Extensive experiments on this dataset show thatDIMS significantly outperforms the state-of-the-artbaseline methods in commonly-used metrics by alarge margin.

To summarize, our contributions are threefold:• We propose a novel Video-based Multi-

modal Summarization with Multimodal Output(VMSMO) task which chooses a proper coverframe for the video and generates an appropriatetextual summary of the article.• We propose a Dual-Interaction-based Multi-

modal Summarizer (DIMS) model, which jointlymodels the temporal dependency of video with se-mantic meaning of article, and generates textualsummary with video cover simultaneously.• We construct a large-scale dataset for

VMSMO, and experimental results demonstrate

that our model outperforms other baselines in termsof both automatic and human evaluations.

2 Related Work

Our research builds on previous works in threefields: text summarization, multimodal summariza-tion, and visual question answering.Text Summarization. Our proposed task bases ontext summarization, the methods of which can be di-vided into extractive and abstractive methods (Gaoet al., 2020b). Extractive models (Zhang et al.,2018; Narayan et al., 2018; Chen et al., 2018; Luoet al., 2019; Xiao and Carenini, 2019) directly picksentences from article and regard the aggregateof them as the summary. In contrast, abstractivemodels (Sutskever et al., 2014; See et al., 2017;Wenbo et al., 2019; Gui et al., 2019; Gao et al.,2019a; Chen et al., 2019a; Gao et al., 2019b) gen-erate a summary from scratch and the abstractivesummaries are typically less redundant.Multimodal Summarization. A series of works(Li et al., 2017, 2018; Palaskar et al., 2019; Chanet al., 2019; Chen et al., 2019b; Gao et al., 2020a)focused on generating better textual summarieswith the help of multimodal input. Multimodalsummarization with multimodal output is relativelyless explored. Zhu et al. (2018) proposed to jointlygenerate textual summary and select the most rel-evant image from 6 candidates. Following theirwork, Zhu et al. (2020) added a multimodal objec-tive function to use the loss from the textual sum-mary generation and the image selection. However,in the real-world application, we usually need tochoose the cover figure for a continuous video con-sisting of hundreds of frames. Consequently, thetemporal dependency between frames in a videocannot be simply modeled by several static encod-ing methods.Visual Question Answering. Visual QuestionAnswering (VQA) task is similar to our task intaking images and a corresponding text as input.Most works consider VQA task as a classifica-tion problem and the understanding of image sub-regions or image recognition becomes particularlyimportant (Goyal et al., 2017; Malinowski et al.,2015; Wu et al., 2016; Xiong et al., 2016). Asfor the interaction models, one of the state-of-the-art VQA models (Li et al., 2019) proposed a po-sitional self-attention with a co-attention mecha-nism, which is faster than the recurrent neural net-work (RNN). Guo et al. (2019) devised an image-

9362

FrameEncoder

SegmentEncoder

Conditionalself-attention

InputArticle

Global-attention

Fusiongates

EditinggateDecoder

GeneratedSummary

Matchingscore

Selectedcover

FeatureEncoder DualInteractionModule Multi-Generator

InputVideoFrames

TextEncoder

Figure 2: Overview of DIMS. We divide our model into three parts: (1) Feature Encoder encodes the input articleand video separately; (2) Dual Interaction Module learns fused representation of video and article from differentlevel; (3) Multi-Generator generates the textual summary and chooses the video cover simultaneously.

question-answer synergistic network, where can-didate answers are coarsely scored according totheir relevance to the image and question pair andanswers with a high probability of being correct arere-ranked by synergizing with image and question.

3 Problem Formulation

Before presenting our approach for the VMSMO,we first introduce the notations and key concepts.For an input news article X = {x1, x2, . . . , xTd

}which has Td words, we assume there is a groundtruth textual summary Y = {y1, y2, . . . , yTy}which has Ty words. Meanwhile, there is a newsvideo V corresponding to the article, and we as-sume there is a ground truth cover picture C thatextracts the most important frame from the videocontent. For a given article X and the correspond-ing video V , our model emphasizes salient partsof both inputs by conducting deep interaction. Thegoal is to generate a textual summary Y

′that suc-

cessfully grasp the main points of the article andchoose a frame picture C

′that covers the gist of

the video.

4 Model

4.1 Overview

In this section, we propose our Dual Interaction-based Multimodal Summarizer (DIMS), which canbe divided into three parts in Figure 2:• Feature Encoder is composed of a text encoder

and a video encoder which encodes the input articleand video separately.• Dual Interaction Module conducts deep in-

teraction, including conditional self-attention and

global-attention mechanism between video seg-ment and article to learn different levels of rep-resentation of the two inputs.• Multi-Generator generates the textual sum-

mary and chooses the video cover by incorporatingthe fused information.

4.2 Feature Encoder

4.2.1 Text encoderTo model the semantic meaning of the input newstext X = {x1, x2, . . . , xTd

}, we first use a wordembedding matrix e to map a one-hot representa-tion of each word xi into to a high-dimensionalvector space. Then, in order to encode contextualinformation from these embedding representation,we use bi-directional recurrent neural networks(Bi-RNN) (Hochreiter and Schmidhuber, 1997) tomodel the temporal interactions between words:

hxt = Bi-RNNX(e(xt), hxt−1), (1)

where hxt denotes the hidden state of t-th step inBi-RNN for X . Following (See et al., 2017; Maet al., 2018), we choose the long short-term mem-ory (LSTM) as the Bi-RNN cell.

4.2.2 Video EncoderA news video usually lasts several minutes and con-sists of hundreds of frames. Intuitively, a video canbe divided into several segments, each of which cor-responds to different content. Hence, we choose toencode video hierarchically. More specifically, weequally divide frames in the video into several seg-ments and employ a low-level frame encoder anda high-level segment encoder to learn hierarchicalrepresentation.

9363

Figure 3: Conditional self-attention module, which cap-tures local semantic information within video segmentsunder the guidance of article representation.

Frame encoder. We utilize the Resnet-v1 model(He et al., 2016) to encode frames to alleviate gra-dient vanishing (He et al., 2016) and reduce com-putational costs:

Oij = Resnet-v1(mi

j), (2)

M ij = relu

(Fv(O

ij)), (3)

where mij is the j-th frame in i-th segment and

Fv(·) is a linear transformation function.Segment encoder. As mentioned before, it is

important to model the continuity of images invideo, which cannot be captured by a static encod-ing strategy. We employ RNN network as segmentencoder due to its superiority in exploiting the tem-poral dependency among frames Zhao et al. (2017):

Sij = Bi-RNNS(M

ij , S

ij−1). (4)

Sij denotes the hidden state of j-th step in Bi-RNN

for segment si, and the final hidden state SiTf

de-notes the overall representation of the segment si,where Tf is the number of frames in a segment.

4.3 Dual Interaction ModuleThe cover image of the video should contain thekey point of the whole video, while the texturalsummary should also cover extract the importantinformation from source articles. Hence, these twoinformation formats complement each other in the

summarizing process. In this section, we conducta deep interaction between the video and article tojointly model the temporal dependency of videoand semantic meaning of text. The module con-sists of a conditional self-attention mechanism thatcaptures local semantic information within videosegments and a global-attention mechanism thathandles the semantic relationship between newstext and video from a high level.

Conditional self-attention mechanism. Tradi-tional self-attention can be used to obtain contex-tual video representation due to its flexibility inrelating two elements in a distance-agnostic man-ner. However, as illustrated in Xie et al. (2020),the semantic understanding often relies on morecomplicated dependencies than the pairwise one,especially conditional dependency upon a givenpremise. Hence, in the VMSMO task, we capturethe local semantic information of video conditionedon the input text information.

Our conditional self-attention module shown inFigure 3 is composed of a stack of N identical lay-ers and a conditional layer. The identical layerlearns to encode local video segments while theconditional layer learns to assign high weights tothe video segments conditioned on their relation-ship to the article. We first use a fully-connectedlayer to project each segment representation Si

Tf

into the query Qi, key Ki, and value V i. Then, thescaled dot-product self-attention is defined as:

αi,j =exp

(QiKj

)∑Tsn=1 exp (Q

iKn), (5)

Si =∑Ts

j=1

αi,jVj

√d

, (6)

where d stands for hidden dimension and Ts is thesegment number in a video. Si is then fed into thefeed-forward sub-layer including a residual connec-tion (He et al., 2016) and layer normalization (Baet al., 2016).

Next, we highlight the salient part of the videounder the guidance of article. Taking the articleinformation hxTd

as condition, the attention scoreon each original segment representation Si

Tfis cal-

culated as:

βi = σ(Fs(S

iTfhxTd

)). (7)

The final conditional segment representation Sci is

denoted as βiSi.Global-attention mechanism. The global-

attention module grounds the article representation

9364

on the video segments and fuses the information ofthe article into the video, which results in an article-aware video representation and a video-aware arti-cle representation. Formally, we utilize a two-wayattention mechanism to obtain the co-attention be-tween the encoded text representation hxt and theencoded segment representation Si

Tf:

Eti = Fh(h

xt )(Ft(S

iTf))T

. (8)

We use Eti to denote the attention weight on the

t-th word by the i-th video segment. To learn thealignments between text and segment information,the global representations of video-aware articlehxt and article-aware video Sc

i are computed as:

hxt =∑Td

i=1Et

iSiTf, (9)

Sci =

∑Ts

t=1

(Et

i

)Thxt . (10)

4.4 Multi-Generator

In the VMSMO task, the multi-generator modulenot only needs to generate the textual summary butalso needs to choose the video cover.

Textual summary generation. For the firsttask, we use the final state of the input text rep-resentation hxTd

as the initial state d0 of the RNNdecoder, and the t-th generation procedure is:

dt = LSTMdec(dt−1, [e(yt−1);hct−1]), (11)

where dt is the hidden state of the t-th decodingstep and hct−1 is the context vector calculated bythe standard attention mechanism (Bahdanau et al.,2014), and is introduced below.

To take advantage of the article representationhxt and the video-aware article representation hxt ,we apply an “editing gate” γe to decide how muchinformation of each side should be focused on:

γe = σ (Fd(dt)) , (12)

gi = γehxi + (1− γe)hxi . (13)

Then the context vector hct−1 is calculated as:

δit =exp(Fa(gi, dt))∑j exp(Fa(gj , dt))

. (14)

hct =∑

iδitgi, (15)

Finally, the context vector hct is concatenated withthe decoder state dt and fed into a linear layer to

obtain the generated word distribution Pv:

dot = σ (Fp([dt;hct ])) , (16)

Pv = softmax (Fo(dot )) . (17)

Following See et al. (2017), we also equip ourmodel with pointer network to handle the out-of-vocabulary problem. The loss of textual summarygeneration is the negative log likelihood of the tar-get word yt:

Lseq = −∑Ty

t=1logPv(yt). (18)

Cover frame selector. The cover frame is cho-sen based on hierarchical video representations, i.e.,the original frame representation M i

j and the con-ditional segment representation Sc

i with the article-aware segment representation Sc

i :

pij = γ1fSci + γ2f S

ci + (1− γ1f − γ2f )M i

j , (19)

yci,j = σ(Fc(p

ij)), (20)

where yci,j is the matching score of the candidateframes. The fusion gates γ1f and γ2f here are deter-mined by the last text encoder hidden state hxTd

:

γ1f = σ(Fm(hxTd

)), (21)

γ2f = σ(Fn(h

xTd)). (22)

We use pairwise hinge loss to measure the selec-tion accuracy:

Lpic =∑N

max(0, ycnegative − ycpositive + margin

),

(23)

where ycnegative and ycpositive corresponds to thematching score of the negative samples and theground truth frame, respectively. The margin in theLpic is the rescale margin in hinge loss.

The overall loss for the model is:

L = Lseq + Lpic. (24)

5 Experimental Setup

5.1 DatasetTo our best knowledge, there is no existing large-scale dataset for VMSMO task. Hence, we collectthe first large-scale dataset for VMSMO task fromWeibo, the largest social network website in China.Most of China’s mainstream media have Weiboaccounts, and they publish the latest news in their

9365

accounts with lively videos and articles. Corre-spondingly, each sample of our data contains anarticle with a textual summary and a video with acover picture. The average video duration is oneminute and the frame rate of video is 25 fps. Forthe text part, the average length of article is 96.84words and the average length of textual summaryis 11.19 words. Overall, there are 184,920 samplesin the dataset, which is split into a training set of180,000 samples, a validation set of 2,460 samples,and a test set of 2,460 samples.

5.2 Comparisons

We compare our proposed method against summa-rization baselines and VQA baselines.Traditional Textual Summarization baselines:Lead: selects the first sentence of article as thetextual summary (Nallapati et al., 2017).TexkRank: a graph-based extractive summarizerwhich adds sentences as nodes and uses edges toweight similarity (Mihalcea and Tarau, 2004).PG: a sequence-to-sequence framework combinedwith attention mechanism and pointer network (Seeet al., 2017).Unified: a model which combines the strengthof extractive and abstractive summarization (Hsuet al., 2018).GPG: Shen et al. (2019) proposed to generate tex-tual summary by “editing” pointed tokens insteadof hard copying.Multimodal baselines:How2: a model proposed to generate textual sum-mary with video information (Palaskar et al., 2019).Synergistic: a image-question-answer synergisticnetwork to value the role of the answer for precisevisual dialog(Guo et al., 2019).PSAC: a model adding the positional self-attentionwith co-attention on VQA task (Li et al., 2019).MSMO: the first model on multi-output task,which paid attention to text and images during gen-erating textual summary and used coverage to helpselect picture (Zhu et al., 2018).MOF: the model based on MSMO which addedconsideration of image accuracy as another loss(Zhu et al., 2020).

5.3 Evaluation Metrics

The quality of generated textual summary is evalu-ated by standard full-length Rouge F1 (Lin, 2004)following previous works (See et al., 2017; Chenet al., 2018). R-1, R-2, and R-L refer to unigram,

R-1 R-2 R-L

extractive summarizationLead 16.2 5.3 13.9TextRank 13.7 4.0 12.5

abstractive summarizationPG (See et al., 2017) 19.4 6.8 17.4Unified (Hsu et al., 2018) 23.0 6.0 20.9GPG (Shen et al., 2019) 20.1 4.5 17.3

our modelsDIMS 25.1 9.6 23.2

Table 1: Rouge scores comparison with traditional tex-tual summarization baselines.

bigrams, and the longest common subsequence re-spectively. The quality of chosen cover frame isevaluated by mean average precision (MAP) (Zhouet al., 2018) and recall at position (Rn@k) (Taoet al., 2019). Rn@k measures if the positive sam-ple is ranked in the top k positions of n candidates.

5.4 Implementation Details

We implement our experiments in Tensor-flow (Abadi et al., 2016) on an NVIDIA GTX 1080Ti GPU. The code for our model is available on-line2. For all models, we set the word embeddingdimension and the hidden dimension to 128. Theencoding step is set to 100, while the minimumdecoding step is 10 and the maximum step is 30.For video preprocessing, we extract one of every120 frames to obtain 10 frames as cover candidates.All candidates are resized to 128x64. We regard theframe that has the maximum cosine similarity withthe ground truth cover as the positive sample, andothers as negative samples. Note that the averagecosine similarity of positive samples is 0.90, whichis a high score, demonstrating the high quality ofthe constructed candidates. In the conditional self-attention mechanism, the stacked layer number isset to 2. For hierarchical encoding, each segmentcontains 5 frames. Experiments are performed witha batch size of 16. All the parameters in our modelare initialized by Gaussian distribution. Duringtraining, we use Adagrad optimizer as our optimiz-ing algorithm and we also apply gradient clippingwith a range of [−2, 2]. The vocabulary size islimited to 50k. For testing, we use beam searchwith beam size 4 and we decode until an end-of-sequence token is reached. We select the 5 bestcheckpoints based on performance on the valida-tion set and report averaged results on the test set.

2https://github.com/yingtaomj/VMSMO

9366

R-1 R-2 R-L MAP R10@1 R10@2 R10@5

video-based summarizationHow2 (Palaskar et al., 2019) 21.7 6.1 19.0 - - - -

Visual Q&A methodsSynergistic (Guo et al., 2019) - - - 0.588 0.444 0.557 0.759PSAC (Li et al., 2019) - - - 0.524 0.363 0.481 0.730

multimodal summarization with multimodal outputMSMO (Zhu et al., 2018) 20.1 4.6 17.3 0.554 0.361 0.551 0.820MOF (Zhu et al., 2020) 21.3 5.7 17.9 0.615 0.455 0.615 0.817

our modelsDIMS 25.1 9.6 23.2 0.654 0.524 0.634 0.824DIMS-textual summary 22.0 6.3 19.2 - - - -DIMS-cover frame - - - 0.611 0.449 0.610 0.823

ablation studyDIMS-G 23.7 7.4 21.7 0.624 0.471 0.619 0.819DIMS-S 24.4 8.9 22.5 0.404 0.204 0.364 0.634

Table 2: Rouge and Accuracy scores comparison with multimodal baselines.

6 Experimental Result

6.1 Overall Performance

We first examine whether our DIMS outperformsother baselines as listed in Table 1 and Table 2.Firstly, abstractive models outperform all extractivemethods, demonstrating that our proposed datasetis suitable for abstractive summarization. Sec-ondly, the video-enhanced models outperform tra-ditional textural summarization models, indicat-ing that video information helps generate sum-mary. Finally, our model outperforms MOF by17.8%, 68.4%, 29.6%, in terms of Rouge-1, Rouge-2, Rouge-L, and 6.3%, 15.2% in MAP and R@1respectively, which proves the superiority of ourmodel. All our Rouge scores have a 95% confi-dence interval of at most ±0.55 as reported by theofficial Rouge script.

In addition to automatic evaluation, system per-formance was also evaluated on the generated tex-tual summary by human judgments on 70 ran-domly selected cases similar to Liu and Lapata(2019). Our first evaluation study quantified thedegree to which summarization models retain keyinformation from the articles following a question-answering (QA) paradigm (Narayan et al., 2018).A set of questions was created based on the goldsummary. Then we examined whether participantswere able to answer these questions by reading sys-tem summaries alone. We created 183 questions intotal varying from two to three questions per goldsummary. Correct answers were marked with 1 and0 otherwise. The average of all question scores isset to the system score.

QA(%) Rating

How2 46.2 -0.24MOF 51.3 -0.14Unified 53.8 0.00DIMS 66.7 0.38

Table 3: System scores based on questions answeredby human and summary quality rating.

Our second evaluation estimated the overall qual-ity of the textual summaries by asking participantsto rank them according to its Informativeness (doesthe summary convey important contents about thetopic in question?), Coherence (is the summaryfluent and grammatical?), and Succinctness (doesthe summary avoid repetition?). Participants werepresented with the gold summary and summariesgenerated from several systems better on automet-rics and were asked to decide which was the bestand the worst. The rating of each system was cal-culated as the percentage of times it was chosenas best minus the times it was selected as worst,ranging from -1 (worst) to 1 (best).

Both evaluations were conducted by threehighly educated native-speaker annotators. Partici-pants evaluated summaries produced by Unified,How2, MOF and our DIMS, all of which achievedhigh perfromance in automatic evaluations. Asshown in Table 3, on both evaluations, participantsoverwhelmingly prefer our model. All pairwisecomparisons among systems are statistically signif-icant using the paired student t-test for significanceat α = 0.01.

9367

FrameEncoder

SegmentEncoder

Conditionalself-attention

InputArticle

Global-attention

Fusiongate

EditinggateDecoder

GeneratedSummary

Matchingscore

Selectedcover

FeatureEncoder Dual-InteractionModule Multi-Generator

InputVideoFrames

TextEncoder

两位老人手牵手

的画面引人注目

。

Twoold men 's

picture

striking eyes .

逛展 6 月 27 日，江苏书展开展

June27-th ,

hand in hand

visit exhibitio

nJiangsu

Book Fair

start exhibitio

n

the first day

第一天

Figure 4: Visualizations of global-attention matrix between the news article and two frames in the same video.

6.2 Ablation Study

Next, we conduct ablation tests to assess the im-portance of the conditional self-attention mecha-nism (-S), as well as the global-attention (-G) inTable 2. All ablation models perform worse thanDIMS in terms of all metrics, which demonstratesthe preeminence of DIMS. Specifically, the global-attention module contributes mostly to the textualsummary generation, while the conditional self-attention module is more important for choosingcover frame.

6.3 Analysis of Multi-task learning

Our model aims to generate textural summaryand choose cover frame at the same time, whichcan be regarded as a multi-task. Hence, in thissection, we examine whether these two taskscan complement each other. We separate ourmodel into two single-task architecture, named asDIMS-textual summary and DIMS-coverframe, which generates textural summary andchooses video cover frame, respectively. Theresult is shown in Table 2. It can be seenthat the multi-task DIMS outperforms single-taskDIMS-textual summary and DIMS-coverframe, improving the performance of summariza-tion by 20.8% in terms of ROUGE-L score, andincreasing the accuracy of cover selection by 7.0%on MAP.

6.4 Visualization of dual interaction module

To study the multimodal interaction module, wevisualize the global-attention matrix Et

i in Equa-tion 8 on one randomly sampled case, as shownin Figure 4. In this case, we show the attentionon article words of two representative images inthe video. The darker the color is, the higher theattention weight is. It can be seen that for the leftfigure, the word hand in hand has a higher weightthan picture, while for the right figure, the wordBook Fair has the highest weight. This correspondsto the fact that the main body of the left frame istwo old men, and the right frame is about readingbooks.

Article: On August 26, in Shanxi Ankang, a 12-year-oldjunior girl Yu Taoxin goose-stepped like parade duringthe military training in the new semester, and won thou-sands of praises. Yu Taoxin said that her father was aveteran, and she worked hard in military training becauseof the influence of her father. Her father told her that mil-itary training should be strict as in the army. 8月26日，陕西安康，12岁的初一女生余陶鑫，在新学期军训期间，她踢出阅兵式般的标准步伐，获千万点赞。余陶鑫说，爸爸是名退伍军人，军训刻苦是因为受到爸爸影响，爸爸曾告诉她，军训时就应和在部队里一样，严格要求自己。Reference summary: A 12-year-old girl goose-steppedlike parade during the military training, “My father is aveteran.” 12岁女孩军训走出阅兵式步伐，“爸爸是退伍军人”QA: What happened on the 12-year-old girl? [Shegoose-stepped like parade.] 这个12岁女孩做了什么？[她走出阅兵式步伐。]Why did she do this? [She was influenced by her father]她为什么这样做？[她受到爸爸的影响。]Unfied: 12-year-old gril Yu Taoxin goose-stepped dur-ing military training. 12岁女生余陶鑫军训期间阅兵式般的标准步伐How2: 12-year-old girls were organized military train-ing, and veteran mother parade. 12岁女生组团军训，退伍军人妈妈阅兵式MOF: A 12-year-old junior citizen [unk]: father gave akicked like. 1名12岁初一市民[unk]：爸爸踢式点赞DIMS: A 12-year-old junior girl goose-stepped like pa-rade: My father is a veteran, and military training shouldbe strict as in the army. 12岁初一女生踢出阅兵式：爸爸是名退伍军人，军训时就应和在部队一样

GroundTruth MOF DIMS

Table 4: Examples of the generated summary by base-lines and DIMS.

We show a case study in Table 4, which in-cludes the input article and the generated summaryby different models. We also show the question-answering pair in human evaluation and the chosencover. The result shows that the summary gener-ated by our model is both fluent and accurate, andthe cover frame chosen is also similar to the groundtruth frame.

9368

7 Conclusion

In this paper, we propose the task of Video-basedMultimodal Summarization with Multimodal Out-put (VMSMO) which chooses a proper video coverand generates an appropriate textual summary for avideo-attached article. We propose a model namedDual-Interaction-based Multimodal Summarizer(DIMS) including a local conditional self-attentionmechanism and a global-attention mechanism tojointly model and summarize multimodal input.Our model achieves state-of-the-art results in termsof autometrics and outperforms human evaluationsby a large margin. In near future, we aim to incorpo-rate the video script information in the multimodalsummarization process.

Acknowledgments

We would like to thank the anonymous reviewersfor their constructive comments. This work wassupported by the National Key Research and Devel-opment Program of China (No.2020AAA0105200),and the National Science Foundation of China(NSFC No.61876196, No.61672058). Rui Yan ispartially supported as a Young Fellow of BeijingInstitute of Artificial Intelligence (BAAI).

ReferencesMartın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In 12th {USENIX} Symposiumon Operating Systems Design and Implementation({OSDI} 16), pages 265–283.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. ICLR.

Zhangming Chan, Xiuying Chen, Yongliang Wang,Juntao Li, Zhiqiang Zhang, Kun Gai, Dongyan Zhao,and Rui Yan. 2019. Stick to the facts: Learn-ing towards a fidelity-oriented e-commerce prod-uct description generation. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 4960–4969.

Xiuying Chen, Zhangming Chan, Shen Gao, Meng-Hsuan Yu, Dongyan Zhao, and Rui Yan. 2019a.

Learning towards abstractive timeline summariza-tion. In IJCAI, pages 4939–4945.

Xiuying Chen, Shen Gao, Chongyang Tao, Yan Song,Dongyan Zhao, and Rui Yan. 2018. Iterative docu-ment representation learning towards summarizationwith polishing. In EMNLP, pages 4088–4097.

Xiuying Chen, Daorui Xiao, Shen Gao, Guojun Liu,Wei Lin, Bo Zheng, Dongyan Zhao, and Rui Yan.2019b. Rpm-oriented query rewriting frameworkfor e-commerce keyword-based sponsored search.arXiv preprint arXiv:1910.12527.

Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan,Dongyan Zhao, and Rui Yan. 2019a. How to writesummaries with patterns? learning towards abstrac-tive summarization through prototype editing. arXivpreprint arXiv:1909.08837.

Shen Gao, Xiuying Chen, Piji Li, Zhaochun Ren, Li-dong Bing, Dongyan Zhao, and Rui Yan. 2019b. Ab-stractive text summarization by incorporating readercomments. In Proceedings of the AAAI Conferenceon Artificial Intelligence, volume 33, pages 6399–6406.

Shen Gao, Xiuying Chen, Chang Liu, Li Liu, DongyanZhao, and Rui Yan. 2020a. Learning to respond withstickers: A framework of unifying multi-modality inmulti-turn dialog. In Proceedings of The Web Con-ference 2020, pages 1138–1148.

Shen Gao, Xiuying Chen, Zhaochun Ren, DongyanZhao, and Rui Yan. 2020b. From standard summa-rization to new tasks and beyond: Summarizationwith manifold information. In IJCAI.

Yash Goyal, Tejas Khot, Douglas Summers-Stay,Dhruv Batra, and Devi Parikh. 2017. Making thev in vqa matter: Elevating the role of image under-standing in visual question answering. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 6904–6913.

Min Gui, Junfeng Tian, Rui Wang, and Zhenglu Yang.2019. Attention optimization for abstractive docu-ment summarization. In EMNLP/IJCNLP.

Dalu Guo, Chang Xu, and Dacheng Tao. 2019. Image-question-answer synergistic network for visual dia-log. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 10434–10443.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In CVPR, pages 770–778.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Computation, 9:1735–1780.

Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, KeruiMin, Jing Tang, and Min Sun. 2018. A unifiedmodel for extractive and abstractive summarizationusing inconsistency loss. ACL.

9369

Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang,Chengqing Zong, et al. 2018. Multi-modal sentencesummarization with modality attention and image fil-tering. IJCAI-PRICAI.

Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang,Chengqing Zong, et al. 2017. Multi-modal summa-rization for asynchronous collection of text, image,audio and video. EMNLP/IJCNLP.

Xiangpeng Li, Jingkuan Song, Lianli Gao, XianglongLiu, Wenbing Huang, Xiangnan He, and ChuangGan. 2019. Beyond rnns: Positional self-attentionwith co-attention for video question answering. InProceedings of the AAAI Conference on Artificial In-telligence, volume 33, pages 8658–8665.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text summarizationbranches out, pages 74–81.

Yang Liu and Mirella Lapata. 2019. Hierarchical trans-formers for multi-document summarization. ACL.

Ling Luo, Xiang Ao, Y. S. Song, Feiyang Pan, MinYang, and Qing He. 2019. Reading like her: Hu-man reading inspired extractive summarization. InEMNLP/IJCNLP.

Shuming Ma, Xu Sun, Junyang Lin, and XuanchengRen. 2018. A hierarchical end-to-end model forjointly improving text summarization and sentimentclassification. IJCAI-PRICAI.

Mateusz Malinowski, Marcus Rohrbach, and MarioFritz. 2015. Ask your neurons: A neural-based ap-proach to answering questions about images. In Pro-ceedings of the IEEE international conference oncomputer vision, pages 1–9.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In EMNLP.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of docu-ments. In Thirty-First AAAI Conference on ArtificialIntelligence.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. In NAACL-HLT.

Shruti Palaskar, Jindrich Libovicky, Spandana Gella,and Florian Metze. 2019. Multimodal abstractivesummarization for how2 videos. ACL.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In ACL.

Xiaoyu Shen, Yang Zhao, Hui Su, and Dietrich Klakow.2019. Improving latent alignment in text summa-rization by generalizing the pointer generator. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 3753–3764.

I Sutskever, O Vinyals, and QV Le. 2014. Sequence tosequence learning with neural networks. Advancesin NIPS.

Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu,Dongyan Zhao, and Rui Yan. 2019. Multi-representation fusion network for multi-turn re-sponse selection in retrieval-based chatbots. WSDM,pages 267–275.

Wang Wenbo, Yang Gao, Huang Heyan, and YuxiangZhou. 2019. Concept pointer network for abstrac-tive summarization. In EMNLP/IJCNLP.

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick,and Anton Van Den Hengel. 2016. Ask me any-thing: Free-form visual question answering based onknowledge from external sources. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 4622–4630.

Wen Xiao and Giuseppe Carenini. 2019. Extractivesummarization of long documents by combiningglobal and local context. In EMNLP/IJCNLP.

Yujia Xie, Tianyi Zhou, Yi Mao, and Weizhu Chen.2020. Conditional self-attention for query-basedsummarization. arXiv preprint arXiv:2002.07338.

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visual andtextual question answering. In International confer-ence on machine learning, pages 2397–2406.

Xingxing Zhang, Mirella Lapata, Furu Wei, and MingZhou. 2018. Neural latent extractive document sum-marization. EMNLP/IJCNLP.

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hier-archical recurrent neural network for video summa-rization. In Proceedings of the 25th ACM interna-tional conference on Multimedia, pages 863–871.

Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and HuaWu. 2018. Multi-turn response selection for chat-bots with deep attention matching network. ACL,1:1118–1127.

Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Ji-ajun Zhang, Chengqing Zong, et al. 2018. Msmo:multimodal summarization with multimodal output.EMNLP/IJCNLP.

Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li,Chengqing Zong, and Changliang Li. 2020. Multi-modal summarization with guidance of multimodalreference. Image, 3:4–64.

vmsmo: learning to generate multimodal summary for video

Documents