arxiv:1603.03369v3 [cs.cv] 29 apr 2016 · based skims [24,31,36,37], story-boards [6,9], mon-tages...

Summary Transfer: Exemplar-based Subset Selection for VideoSummarization

Ke Zhang∗, Wei-Lun Chao∗

University of Southern CaliforniaLos Angeles, CA 90089

zhang.ke,[email protected]

Fei ShaUniversity of CaliforniaLos Angeles, CA [email protected]

Kristen GraumanUniversity of Texas at Austin

Austin, TX [email protected]

Abstract

Video summarization has unprecedented importanceto help us digest, browse, and search today’s ever-growing video collections. We propose a novel sub-set selection technique that leverages supervision in theform of human-created summaries to perform automatickeyframe-based video summarization. The main idea isto nonparametrically transfer summary structures fromannotated videos to unseen test videos. We show how toextend our method to exploit semantic side informationabout the video’s category/genre to guide the transferprocess by those training videos semantically consistentwith the test input. We also show how to generalize ourmethod to subshot-based summarization, which not onlyreduces computational costs but also provides more flex-ible ways of defining visual similarity across subshotsspanning several frames. We conduct extensive evalua-tion on several benchmarks and demonstrate promisingresults, outperforming existing methods in several set-tings.

1. IntroductionThe amount of video data has been explosively in-

creasing due to the proliferation of video recording de-vices such as mobile phones, wearable and ego-centriccameras, surveillance equipment, and others. Accord-ing to YouTube statistics, about 300 hours of videoare uploaded every minute [2]. To cope with this

∗ Equal contributions

video data deluge, automatic video summarization hasemerged as a promising tool to assist in curating videocontents for fast browsing, retrieval, and understanding[14, 30, 47, 49], without losing important information.

Video can be summarized at several levels of abstrac-tion: keyframes [10, 25, 28, 35, 50], segment or shot-based skims [24, 31, 36, 37], story-boards [6, 9], mon-tages [16, 42] or video synopses [40]. In this paper, wefocus on developing learning algorithms for selectingkeyframes or subshots from a video sequence. Namely,the input is a video and its subshots and the output is an(ordered) subset of the frames or subshots in the video.

Inherently, summarization is a structured predictionproblem where the decisions on whether to include orexclude certain frames or subshots into the subset are in-terdependent. This is in sharp contrast to typical classifi-cation and recognition tasks where the output is a singlelabel.

The structured nature of subset selection presentsa major challenge. Current approaches rely heavilyon several heuristics to decide the desirability of eachframe: representativeness [15, 17, 37], diversity or uni-formity [29, 50], interestingness and relevance [16, 25,31, 33, 37]. However, combining those frame-basedproperties to output an optimal subset remains an openand understudied problem. In particular, researchers arehampered by the lack of knowledge on the “global” cri-teria human annotators presumably optimize when man-ually creating a summary.

Recently, initial steps investigating supervised learn-ing for video summarization have been made. Theydemonstrate promising results [10, 12], often exceeding

1

arX

iv:1

603.

0336

9v3

[cs

.CV

] 2

9 A

pr 2

016

the conventional unsupervised clustering of frames. Themain idea is to use a training set of videos and human-created summaries as targets to adapt the parameters ofa subset selection model to optimize the quality of thesummarization. If successful, a strong form of super-vised learning would extract high-level semantics andcues from human-created summaries to guide summa-rization.

Supervised learning for structured prediction is achallenging problem in itself. Existing parametric tech-niques typically require a complex model with sufficientannotated data to represent highly complicated decisionregions in a combinatorially large output space. In thispaper, we explore a nonparametric supervised learningapproach to summarization. Our method is motivatedby the observation that similar videos share similar sum-mary structures. For instance, suppose we have a collec-tion of videos of wedding ceremonies inside churches. Itis quite likely good summaries for those videos would allcontain frames portraying brides proceeding to the altar,standing of the grooms and their best men, the priests’preaching, the exchange of rings, etc. Thus, if one suchvideo is annotated with human-created summaries, aclever algorithm could essentially “copy and paste” therelative positions of the extracted frames in the anno-tated video sequence and apply them to an unannotatedone to extract relevant frames. Note that this type oftransferring summary structures across videos need notassume a precise matching of visual appearance in cor-responding frames — there is no need to have the samepriest as long as the frames of the priests in each videoare sufficiently different from other frames to be “sin-gled out” as possible candidates.

The main idea of our approach centers around thisintuition, that is, non-parametric learning from exem-plar videos to transfer summary structures to novel in-put videos. In recent years, non-parametric methods inthe vision literature have shown great promise in lettingthe data “speak for itself”, though thus far primarily fortraditional categorization or regression tasks (e.g., labeltransfer for image recognition [27, 44] or scene comple-tion [13]).

How can summarization be treated non-parametrically? A naive application of non-parametriclearning to video summarization would treat keyframeselection as a binary classification problem—matchingeach frame in the unannotated test video to the nearesthuman-selected keyframes in some training video, anddeciding independently per frame whether it should beincluded in or excluded from the summary. Such anapproach, however, conceptually fails on two fronts.

First, it fails to account for the relatedness between asummary’s keyframes. Second, it limits the system toinputs having very similar frame-level matches in theannotated database, creating a data efficiency problem.

Therefore, rather than transfer simple relevance la-bels, our key insight is to transfer the structures impliedby subset selection. We show how kernel-based repre-sentations of a video’s frames (subshots) can be used todetect and align the meta-cues present in selected sub-sets. In this way, we compose novel summaries by bor-rowing recurring structures in exemplars for which wehave seen both the source video and its human-createdsummary. A conceptual diagram of our approach isshown in Fig. 1.

In short, our main contributions are an original mod-eling idea that leverages non-parametric learning forstructured objects (namely, selecting subsets from videosequences), a summarization method that advances thefrontier of supervised learning for video summarization,and an extensive empirical study validating the proposedmethod and attaining far better summarization resultsthan competing methods on several benchmark datasets.

The rest of the paper is organized as follows. In sec-tion 3, we describe our approach of nonparametric struc-ture transfer. We report and analyze experimental resultsin section 4 and conclude in section 5.

2. Related WorkA variety of video summarization techniques have

been developed in the literature [34, 45]. Broadly speak-ing, most methods first compute visual features at theframe level, then apply some selection criteria to priori-tize frames for inclusion in the output summary.

Keyframe-based methods select a subset of framesto form a summary, and typically use low-level featureslike optical flow [46] or image differences [50]. Recentwork also injects high-level information such as objecttracks [28] or “important” objects [25], or takes user in-put to generate a storyboard [9]. In contrast, video skim-ming techniques first segment the input into subshots us-ing shot boundary detection. The summary then consistsof a selected set of representative subshots [24, 36, 37].

Selection criteria for summaries often aim to re-tain diverse and representative frames [15, 17, 29, 37,50]. Another strategy is to predict object and eventsaliency [16, 25, 33, 37], or to pose summarization asan anomaly detection problem [19, 51]. When the cam-era is known to be stationary, background subtractionand object tracking offer valuable cues about the saliententities in the video [5, 40].

Whatever the above choices, existing methods are

k-th

frame

l-th

frame

i-th

frame

j-th

frame

r-th

training

video

test

video

simr(i, k) simr(j, l)

tUDLQLQJYLGHRV¶

summarization kernels

video 1

video R Lr, kl

WHVWYLGHR¶V

summarization kernel

Lij

summarized

video

1 2 3 4 5 6

, sim ( , )sim ( , )ij r r r kl

r k l

= i k j l¦¦¦L L

DPP Inference

5 3 2

1 2 3

Figure 1: The conceptual diagram of our approach, which leverages the intuition that similar videos share similarsummary structures. The main idea is nonparametric structure transfer, ie, transferring the subset structures in thehuman-created summaries (blue frames) of the training videos to a new video. Concretely, for each new video, wefirst compute frame-level similarity between training and test videos (i.e., sim(·, ·), cf. eq. (4)). Then, we encodethe summary structures in the training videos with kernel matrices made of binarized pairwise similarity among theirframes. We combine those structures, factoring the pairwise similarity between the training and the test videos, into akernel matrix that encodes the summary structure for the test video, cf. eq. (7). Finally, the summary is decoded byinputting the kernel matrix to a probabilistic model called the determinantal point process (DPP) to extract a globallyoptimal subset of frames.

almost entirely unsupervised. For example, they em-ploy clustering to identify groups of related frames,and/or manually define selection criteria based on in-tuition for the problem. Some prior work includes su-pervised learning components (e.g., to generate regionswith learned saliency metrics [25], train classifiers forcanonical viewpoints [17], or recognize fragments of aparticular event category [39]), but they do not learn thesubset selection procedure itself.

Departing from unsupervised methods, limited re-cent work formulates video summarization as a sub-set selection problem [10, 12, 18, 48]. This enablessupervised learning, exploiting knowledge encoded inhuman-created summaries. In [12], a submodular func-tion optimizes a global objective function of the desir-ability of selected frames, while [10] uses a probabilis-tic model that maximizes the probability of the ground-truth subsets.

The novelty of our approach is to learn non-parametrically from exemplar training videos to transfersummary structures to test videos. In contrast to previ-ous parametric models [10, 12], non-parametric learn-ing generalizes to new videos by directly exploiting pat-terns in the training data. This has the advantage of gen-eralizing locally within highly nonsmooth regions: aslong as a test video’s “neighborhood” contains an anno-tated training video, the summary structure of that train-ing video will be transferred. In contrast, parametrictechniques typically require a complex model with suf-ficient annotated data to parametrically represent thoseregions. Our non-parametric approach also puts designpower into flexible kernel functions, as opposed to re-lying strictly on combinations of hand-crafted criteria

(e.g., frame interestingness, diversity, etc.).

3. Approach

We cast the process of extracting a summary from avideo as selecting a subset of items (i.e., video frames)from a ground set (i.e., the whole video). Given a cor-pus of videos and their human-created summaries, ourlearning algorithm learns the optimal criteria for subsetselection and applies them to unseen videos to extractsummaries.

The first step is to decide on a subset selection modelthat can output a structure (i.e., an ordered subset). Forsuch structured prediction problems, we focus on the de-terminantal point process (DPP) [23] which has the ben-efits of being more computationally tractable than manyprobabilistic graphical models [20]. Empirically, DPPhas been successfully applied to documentation summa-rization [22], image retrieval [7] and more recently, tovideo summarization [3, 10].

We will describe first DPP and how it can be usedfor video summarization. We then describe our mainapproach in detail, as well as its several extensions.

3.1. Background

Let Y = 1, 2, · · · ,N denote a (ground) set of Nitems, such as video frames. The ground set has 2N sub-sets. The DPP over the N items assigns a probability toeach of those subsets. Let y ⊆ Y denote a subset andthe probability of selecting it is given by

P (y;L) =det(Ly)

det(L+ I), (1)

where L is a symmetric, positive semidefinite matrixand I is an identity matrix of the same size of L. Ly isthe principal minor (sub-matrix) with rows and columnsfrom L indexed by the integers in y.

DPP can be seen conceptually as a fully connected N-node Markov network where the nodes correspond to theitems. This network’s node-potentials are given by thediagonal elements ofL and the edge potentials are givenby the off-diagonal elements in L. Note that those “po-tentials” cannot be arbitrarily assigned — to ensure theyform a valid probabilistic model, the matrix L needs tobe positive semidefinite. Due to this constraint, L is of-ten referred to as a kernel matrix whose elements can beinterpreted as measuring the pairwise compatibility.

Besides computational tractability which facilitatesparameter estimation, DPP has an important modelingadvantage over standard Markov networks. Due to thecelebrated Hammersley-Clifford Theorem, Markov net-works cannot model distributions where there are zero-probability events. On the other hand, DPP is capable ofassigning zero probability to absolutely impossible (orinadmissible) instantiations of random variables.

To see its use for video summarization, suppose thereare two frames that are identical. For keyframe-basedsummarization, any subset containing such identicalframes should be ruled out by being assigned zero prob-ability. This is impossible in Markov networks — nomatter how small, Markov networks will assign strictlypositive probabilities to an exponentially large numberof subsets containing identical frames. For DPP, sincethe two items are identical, they lead to two identicalcolumns/rows in the matrix L, resulting a determinantzero (thus zero probability) for those subsets. Thus, DPPnaturally encourages selected items in the subset to bediverse, an important objective for summarization andinformation retrieval [23].

The mode of the distribution is the most probablesubset

y∗ = arg maxy P (y;L) = arg maxy det(Ly). (2)

This is an NP-hard combinatorial optimization problem,and there are several approaches to obtaining approxi-mate solutions [8, 23].

The most crucial component in a DPP is its kernelmatrixL. To apply DPP to video summarization, we de-fine the ground set as the frames in a video and identifythe most desired summarization as the MAP inferenceresult of eq. (2). We compute L with a bivariate func-tion over two frames — we dub it the summarizationkernel:

Lij = φ(vi)Tφ(vj) (3)

where φ(·) is a function of the features vi (or vj) com-puted on the i-th (or the j-th ) frames. There are severalchoices. For instance,φ(·) could be an identity function,a nonlinear mapping implied by a Gaussian RBF kernel,or the output of a neural network [10].

As each different video needs to have a different ker-nel, φ(·) needs to be identified from a sufficiently richfunction space so it generalizes from modeling the train-ing videos to new ones. If the videos are substantiallydifferent, this generalization can be challenging, espe-cially when there are not enough annotated videos withhuman-created summaries. Our approach overcomesthis challenge by directly using the training videos’ ker-nel matrices, as described below.

3.2. Non-parametric video summarization

Our approach differs significantly from existing sum-mary methods, including those based on DPPs. Ratherthan learn a single function φ(·) and discard the trainingdataset, we construct L for every unannotated video bycomparing it to the annotated training videos. This con-struction exploits two sources of information: 1) howsimilar the new video is to annotated ones, and 2) howthe training videos are summarized. The former can beinferred directly by comparing visual features at eachframe, while the latter can be “read off” from the human-created training summaries.

We motivate our approach with an idealized exam-ple that provides useful insight. Let us assume weare given a training set of videos and their summariesD = (Yr,yr)Rr=1 and a new video Y to be summa-rized. Suppose this new video is very similar — we de-fine similarity more precisely later — to a particular Yrin D. Then we can reasonably assume that the summaryyr might work well for the new video. As a concreteexample, consider the case where both Y and Yr arevideos for wedding ceremonies inside churches. We an-ticipate seeing similar events across both videos: bridesproceeding to the altars, priests delivering speeches, ex-changing rings etc. Moreover, similarity in their sum-maries exists on the higher-order structural level: therelative positions of the summary frames of yr in the se-quence Y are an informative prior on where the framesof the summary y should be in the new video Y . Specif-ically, as long as we can link the test video to the train-ing video by identifying similar frames,1 we can “copydown”—transfer—the positions of yr and lift the corre-sponding frames in Y to generate its summary y.

1This task is itself not trivial, of course, but it does have the benefitof a rich literature on image matching and recognition work, includingefficient search strategies.

While this intuition is conceptually analogous to thefamiliar paradigm of nearest-neighbor classification, ourapproach is significantly different. The foremost is that,as discussed in section 1, we cannot select frames in-dependently (by nonparametrically learning its similar-ity to those in the training videos). We need to transfersummary structures which encode interdependencies ofselecting frames. Therefore, a naive solution of repre-senting videos with fix-length descriptors in Euclideanspace and literally pretending their summaries are “la-bels” that can be transferred to new data is flawed.

The main steps of our approach are outlined in Fig. 1.We describe them in detail in the following.

Step 1: Frame-based visual similarity To infer sim-ilarity across videos, we experiment with common onesin the computer vision literature for calculating frame-based similarity from visual features vi and vk extractedfrom the corresponding frames:

sim1(i, k) = vTi vk

sim2(i, k) = exp−‖vi − vk‖22 /σsim3(i, k) = exp−(vi − vk)TΩ(vi − vk),

(4)

where σ and Ω are adjustable parameters (constrainedto be positive or positive definite, respectively). Theseforms of similarity measures are often used in visiontasks and are quite flexible, e.g., one can learn the ker-nel parameters for sim3. However, they are not the fo-cus of our approach — we expect more sophisticatedones will only benefit our learning algorithm. We alsoexpect high-level features (such as interestingness, ob-jectness, etc.) could also be beneficial. In section 3.4 wediscuss a generalization to replace frame-level similaritywith subshot-level similarity.

Step 2: Summarization kernels for training videosThe summarization kernels LrRr=1 are not given to usin the training data. However, note that the crucial func-tion of those kernels is to ensure that when used to per-form the MAP inference in eq. (2) to identify the sum-mary on the training video Yr , it will lead to the correctsummarization yr (which is in the training set). Thisprompts us to define the following idealized summariza-tion kernels

Lr = αr

δ(1 ∈ yr) 0 · · · 0

0 δ(2 ∈ yr). . .

......

. . . . . . 00 · · · 0 δ(Nr ∈ yr)

(5)

or more compactly,

Lr = αrdiag(δ(n ∈ yr)Nrn=1), (6)

where diag turns a vector into a diagonal matrix, Nr isthe number of frames in Yr and αr > 1 is an adjustableparameter. The structure of Lr is intuitive: if a frameis in the summary yr, then its corresponding diagonalelement is αr, otherwise 0. It is easy to verify that Lr

indeed gives rise to the correct summarization. Note thatαr > 1 is required. If αr = 1, any subset of yr is a solu-tion to the MAP inference problem (and we will be get-ting a shorter summarization). If αr < 1, the empty setwould be the summary (as the determinant of an emptymatrix is 1, by convention).

Step 3: Transfer summary structure Our aim is nowto transfer the structures encoded by the idealized sum-marization kernels from the training videos to a new(test) video Y . To this end, we synthesize L for newvideo Y out of Lr.

Let i and j index the video frames in Y , with k andl for a specific training video Yr. Specifically, the “con-tribution” from Yr to L is given by

rij =∑k

∑l

simr(i, k)simr(j, l)Lr,kl (7)

whereLr,kl is the element inLr, and simr(·, ·) measuresframe-based (visual) similarity between frames of Y andYr.

Fig. 1 illustrates graphically how frame-based sim-ilarity enables transfer of structures in training sum-maries. We gain further insights by examining the casewhen the frame-based similarity simr(·, ·) is sharplypeaked — namely, there are very good matches betweenspecific pairs of frames (an assumption likely satisfied inthe running example of summarizing wedding ceremonyvideos)

simr(i,m) simr(i, k), ∀ k 6= m

simr(j, n) simr(j, l), ∀ l 6= n.(8)

Under these conditions,

rij ≈ simr(i,m)simr(j, n)Lr,mn. (9)

Intuitively, if Y and Yr are precisely the same video (andthe video frames in Yr are sufficiently visually differ-ent), then the matrix L would be very similar to Lr.Consequently, the summarization yr, computed fromLr, would be a good summary for Y .

To include all the information in the training data, wesum up the contributions from all Yr and arrive at

Lij =∑r

rij . (10)

We introduce a few shorthand notions. Let Sr be aN×Nr matrix whose elements are simr(i, k), the frame-based similarity between N frames in Y and Nr framesin Yr. The kernel matrix L is thus

L =∑r

SrLrSTr . (11)

Note that, L is for the test video with N frames — thereis no need for all the videos have the same length.

Step 4: Extracting summary Once we have com-puted L for the new video, we use the MAP inferenceeq. (2) to extract the summary as the most probable sub-set of frames.

3.3. Learning

Our approach requires adjusting parameters such asα = α1, α2, · · · , αR for the ideal summarization ker-nels and/or Ω for computing frame-based visual simi-larity eq. (4). We use maximum likelihood estimation toestimate those parameters. Specifically, for each videoin the training dataset, we pretend it is a new video andformulate a kernel matrix

Lq =∑r

SqrLrS

qr

T,∀, q = 1, 2, · · · ,R. (12)

We optimize the parameters such that the ground-truthsummarization yq attains the highest probability underLq ,

α∗ = arg maxα

R∑q=1

logP (yq; Lq). (13)

We can formulate a similar criterion to learn the Ω pa-rameter for sim3(·, ·). We carry out the optimization bygradient descent. In our experiments, we set σ for sim2

to be 1, with features normalized to have unit norm. Ad-ditional details are in the Suppl. and omitted here forbrevity.

3.4. Extensions

Category-specific summary transfer Video datasetslabeled with semantically consistent categories havebeen emerging [39, 41]. We view categories as valuableprior information that can be exploited by our nonpara-metric learning algorithm. Intuitively, videos from the

same category (activity type, genre, etc.) are likely tobe similar in part, not only in visual appearance but alsoin high-level semantic cues (such as how key events aretemporally organized), resulting in a similar summarystructures. We extend our method to take advantage suchoptional side information in two ways:

• hard transfer. We compare the new video from acategory c only to the training videos from the samecategory c. Mathematically, for each video cate-gory, we learn a category-specific set of α(c) =

α(c)1 , α

(c)2 , · · · , α(c)

R such that α(c)r > 0 only

when the training video r belongs to category c.

• soft transfer. We relax the requirement in hardtransfer such that α(c)

r > 0 even if the rth trainingvideo is not from the category c. Note that whilewe utilize structural information from all trainingvideos, the way we use them still depends on thetest video’s category.

Subshot-based summary transfer Videos can alsobe summarized at the level of subshots. As opposed toselecting keyframes, subshots contain short but contigu-ous frames, giving a glimpse of a key event. We nextextend our subset selection algorithm to select a subsetof subshots.

To this end, in our conceptual diagram as in Fig. 1, wereplace computing frame-level similarity with subshot-level similarity, where we compare subshots between thetraining videos and the new video. We explore two pos-sible ways to compute the frame-set to frame-set simi-larity:

• Similarity between averaged features. We representthe subshots using the averaged frame-level featurevectors within each subshot. We then compute thesimilarity using the previously defined sim(·, ·).

• Maximum similarity. We compute pairwise simi-larity between frames within the subshots and se-lect the maximum value as the similarity betweenthe subshots.

Both of these two approaches reduce the reliance onframe-based similarity defined in the global frame-based descriptors of visual appearance, loosening the re-quired visual alignment for discovering a good match—especially with the latter max operator, in principle,since it can ignore many unmatchable frames in favorof a single strong link within the subshots. Moreover,the first approach can significantly reduce the compu-tational cost of nonparametric learning as the amount

of pairwise-similarity computation now depends on thenumber of subshots, which is substantially smaller thanthe number of frames.

3.5. Implementation and computation cost

Computing Sr in eq. (11) is an O(N ×∑

r Nr) op-eration. For long videos, several approaches will re-duce the cost significantly. First, it is a standard pro-cedure to down-sample the video (by a factor of 5-30) toreduce the number of frames for keyframe-based sum-marization. Our subshot-based summarization can alsoreduce the computation cost, cf. section 3.4. Generictechniques should also help — sim(·, ·) computes var-ious forms of distances among visual feature vectors.Thus, many fast search techniques apply, such as localitysensitive hashing or tree structures for nearest neighborsearches.

4. ExperimentWe validate our approach on five benchmark datasets.

It outperforms competing methods in many settings. Wealso analyze its strengths and weaknesses.

4.1. Setup

Data For keyframe-based summarization, we experi-ment on three video datasets: the Open Video Project(OVP) [1, 4], the YouTube dataset [4], and the Ko-dak consumer video dataset [32]. All the 3 datasetswere used in [10] and we follow the procedure describedthere to preprocess the data, and to generate trainingground-truths from multiple human-created summaries.For the YouTube dataset, in the following, we report re-sults on 31 videos after discarding 8 videos that are nei-ther “Sports” nor “News” such that we can investigatecategory-specific video summarization (cf. sec. 3.4). InSuppl., we report results on the original dataset.

For subshot-based summarization (cf. sec. 3.4), weexperiment on three video datasets: the portion of MEDwith 160 annotated summaries [39], SumMe [11] andYouTube. Videos in MED are pre-segmented into sub-shots with the Kernel Temporal Segmentation (KTS)method [39] and we observe those subshots. For SumMeand YouTube, we apply KTS to generate our own setsof subshots. MED has 10 well-defined video categoriesallowing us to experiment with category-specific videosummarization on it too. SumMe does not have seman-tic category meta-data. Instead, its video contents have alarge variation in visual appearance and can be classifiedaccording to shooting style: still camera, egocentric, ormoving. Table 1 summarizes key characteristics of thosedatasets with details in the Suppl.

Features For OVP/YouTube/Kodak/SumMe, weencode each frame with an `2-normalized 8192-dimensional Fisher vector [38], computed from SIFTfeatures. For OVP/YouTube/Kodak, we also usecolor histograms. We also experimented with fea-tures from a pre-trained convolutional nets (CNN),details in the Suppl. For MED, we use the provided16512-dimensional Fisher vectors.

Evaluation metrics As in [10, 11, 12] and other priorwork, we evaluate automatic summarization results (A)by comparing them to the human-created summaries (B)and reporting the standard metrics of F-score (F), preci-sion (P), and recall (R) — their definitions are given inthe Suppl.

For OVP/YouTube/Kodak, we follow [10] and utilizethe VSUMM package [4] for finding matching pairs offrames. For SumMe, we follow the procedure in [11,12]. More details are in the Suppl.

Implementation details For each dataset, we ran-domly choose 80% of the videos for training and use theremaining 20% for testing, repeating for 5 or 100 roundsso that we can calculate averaged performance and stan-dard errors. To report existing methods’ results, we useprior published numbers when possible. We also imple-ment the VSUMM approach [4] and obtained code fromthe authors for seqDPP [10] in order to apply them toseveral datasets. We follow the practices in [10] so thatwe can summarize videos sequentially. For MED, weimplement KVS [39] ourselves.

4.2. Main Results

We summarize our key findings in this section. Formore details, please refer to the Suppl.

In Table 2, we compare our approach to both super-vised and unsupervised methods for video summariza-tion. We report published results in the table as well asresults from our own implementation of some methods.Only the best variants of all methods are quoted and pre-sented; others are deferred to the Suppl.

On all but one of the five datasets (OVP), our non-parametric learning method achieves the best results. Ingeneral, the supervised methods achieve better resultsthan the unsupervised ones. Note that even for datasetswith a variety of videos that are not closely visuallysimilar (such as SumMe), our approach attains the bestresult—it indicates our method of transferring summarystructures is effective, able to build on top of even crudeframe similarities.

Table 1: Key characteristics of datasets used in our empirical studies. Most videos in these datasets have a durationfrom 1 to 5 minutes.

Dataset# of

video# of

category# of Training

videos# of Test

videoType of

summarizationEvaluation metricsF-score in matching

Kodak 18 - 14 4 keyframeselected frames

OVP 50 - 40 10 keyframeYoutube 31 2 31 8 keyframe; subshot selected frames; frames in selected subshotsSumMe 25 - 20 5 subshot frames in selected subshotsMED 160 10 128 32 subshot matching selected subshots

Table 2: Performance (F-score) of various video summarization methods. Numbers followed by citations are frompublished results. Others are from our own implementation. Dashes denote unavailable/inapplicable dataset-methodcombinations.

Unsupervised SupervisedVSUMM1 VSUMM2 DT STIMO KVS Video MMR SumMe Submodular seqDPP Ours

[4] [4] [35] [6] [39] [26] [11] [12] [10]Kodak 69.5 67.6 - - - - - - 78.9 82.3OVP 70.3 69.5 57.6 63.4 - - - - 77.7 76.5

YouTube 58.7 59.9 - - - - - - 60.8 61.8MED 28.9 28.8 - - 20.6 - - - - 30.7

SumMe 32.8 33.7 - - - 26.6 39.3 39.7 - 40.9

4.3. Detailed analysis

Advantage of nonparametric learning Nonparamet-ric learning enjoys the property of generalizing locally.That is, as long as a test video has enough correctly an-notated training videos in its neighborhood, the sum-mary structures of those videos will transfer. A para-metric learning method, on the other end, needs to learnboth the locations of those local regions and how to gen-eralize within local regions. If there are not enoughannotated training videos covering the whole range ofdata space, it could be difficult for a parametric learningmethod to learn well.

We design a simple example to illustrate this point.As it is difficult to assess “similarity” to derive nearestneighbors for video, we use a video’s category to de-lineate those “semantically near”. Specifically, we splitYouTube’s 31 videos into two piles, according to theircategories “Sports” or “News”. We then construct se-qDPP, a parametric learning model [10], using all the31 videos, as well as “Sports” or “News” videos onlyto summarize test videos from either category. We thencontrast to our method in the same setting. Table 3 dis-plays the results.

The results on the “News” category convincinglysuggest that the nonparametric approach like ours canleverage the semantically-close videos to outperform theparametric approach with the same amount of annotateddata—or even more data. (Note that, the difference onthe “Sports” category is nearly identical.)

Table 3: Advantage of nonparametric summary transfer

Type of seqDPP Ourstest video all same as test all same as test

Sports 52.8 54.5 53.5 54.4News 67.9 67.7 66.9 69.1

Advantage of exploiting category prior Table 3 al-ready alludes to the fact that exploiting category side-information can improve summarization (cf. contrastingthe column of “same as test” to “all”). Now we in-vestigate this advantage in more detail. Table 4 showshow our nonparametric summary transfer can exploitvideo category information, using the method explainedin section 3.4.

Particularly interesting are our results on the SumMedataset, which itself does not provide semanticallymeaningful categories. Instead, we generate two “fake”categories for it. We first collapse the 10 video cat-egories in the dataset TVSum2 [41] into two super-categories (details in Suppl.) — these two super-categories are semantically similar within each other,though they do not have obvious visual similarity tovideos in the SumMe.

We then build a binary classifier trained on TVSumvideos but classify the videos in SumMe as “super-

2We choose this one as it has raw videos for us to extract featuresand have a larger number of labeled videos for us to build a categoryclassifier

Table 4: Video category information helps summariza-tion

Setting YouTube MED SumMew/o category 60.0 28.9 39.2

category-specific hard 61.5 30.4 40.9category-specific soft 60.6 30.7 40.2

Table 5: Subshot-based summarization on YouTube

Category Frame- Subshot-based-specific based Mean-Feature Max-similarity

No 60.0 60.7 60.9Yes 61.5 61.6 61.8

category I” and “super-category II” and then proceedas if they are ground-truth categories, as in MED andYouTube. Despite this independently developed di-chotomy, the results on SumMe improve over using allvideo data together.

Subshot-based summarization In section 3.4, wediscuss an extension to summarize video at the levelof selecting subshots. This extension not only reducescomputational cost (as the number of subshots is signifi-cantly smaller than that of frames), but also provides ad-ditional means of measuring similarity across videos be-yond frame-level visual similarity inferred from globalframe-based descriptors. Next we examine how suchflexibility can ultimately improve keyframe-based sum-marization. Concretely, we first perform subshot sum-marization, then pick the middle frame in each selectedsubshot as the output keyframes. This allows us to di-rectly compare to keyframe-based summarization usingthe same F-score metric.

Table 5 shows the results. Subshot-based summariza-tion clearly improves frame-based — this is very likelydue to the more robust similarity measures now com-puted at the subshot-level. The improvement is morepronounced when a category prior is not used. One pos-sible explanation is that measuring similarity on videosfrom the same categories is easier and more robust,whereas across categories it is noisier. Thus, when acategory prior is not present, the subshot-based similar-ity measure benefits summarization more.

Other detailed analysis in Suppl. We summarizeother analysis results as follows. We show how to im-prove frame-level similarity by learning better metrics.We also show deep features, powerful for visual cate-gory recognition, is not particularly advantageous com-paring to shallow features. We also show how categoryprior can still be exploited even we do not know the true

Ours (F = 60)

Summary of nearest training video

Ground-truth

seqDPP (F = 62)

Figure 2: A failure case by our approach. Our summarymisses the last two frames from the ground-truth (red-boxed) as the test video (nature scene) transfers fromthe nearest video with a semantically different category(beach activity). See text.

category of test videos.

4.4. Qualitative analysis

Fig. 2 shows a failure case by our method. Here thetest video depicts a natural scene, while its closest train-ing video depicts beach activities. There is a visual sim-ilarity (e.g., in the swath of sky). However, semanti-cally, these two videos do not seem to be relevant andit is likely difficult for the transfer to occur. In particu-lar, our results miss the last two frames where there area lot of grass. This suggests one weakness in our ap-proach: the formulation of our summarization kernel forthe test video does not directly consider the relationshipbetween its own frames — instead, they interact throughtraining videos. Thus, one possible direction to avoidunreliable neighbors in the training videos is to rely onthe visual property of the test video itself. This suggestsfuture work on a hybrid approach with parametric andnonparametric aspects that complement each other.

Please see the Suppl. for more qualitative analysisand example output summaries.

5. ConclusionWe propose a novel supervised learning technique for

video summarization. The main idea is to learn nonpara-metrically to transfer summary structures from train-ing videos to test ones. We also show how to exploitside (semantic) information such as video categoriesand propose an extension for subshot-based summariza-tion. Our method achieves promising results on severalbenchmark datasets, compared to an array of nine exist-ing techniques.

AcknowledgementsKG is partially supported by NSF IIS - 1514118.

Others are partially supported by USC Annen-berg and Viterbi Graduate Fellowships, NSF

IIS - 1451412, 1513966, and CCF-1139148.

References[1] Open video project. http://www.open-video.org/. 7,

13[2] Youtube statistics: https://www.youtube.com/yt/press/statistics.html.

1[3] W.-L. Chao, B. Gong, K. Grauman, and F. Sha. Large-margin

determinantal point processes. In UAI, 2015. 3[4] S. E. F. de Avila, A. P. B. Lopes, A. da Luz, and A. de Albu-

querque Araujo. Vsumm: A mechanism designed to producestatic video summaries and a novel evaluation method. PatternRecognition Letters, 32(1):56–68, 2011. 7, 8, 13, 14, 15, 16, 17

[5] S. Feng, Z. Lei, D. Yi, and S. Z. Li. Online content-aware videocondensation. In CVPR, 2012. 2

[6] M. Furini, F. Geraci, M. Montangero, and M. Pellegrini. Stimo:Still and moving video storyboard for the web scenario. Multi-media Tools and Applications, 46(1):47–69, 2010. 1, 8

[7] J. Gillenwater, A. Kulesza, and B. Taskar. Discovering diverseand salient threads in document collections. In EMNLP, 2012. 3

[8] J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal MAPinference for determinantal point processes. In NIPS, 2012. 4

[9] D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz.Schematic storyboarding for video visualization and editing.ACM Transactions on Graphics, 25(3):862–871, 2006. 1, 2

[10] B. Gong, W.-L. Chao, K. Grauman, and F. Sha. Diverse se-quential subset selection for supervised video summarization. InNIPS, 2014. 1, 3, 4, 7, 8, 12, 13, 15, 17

[11] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool.Creating summaries from user videos. In ECCV, 2014. 7, 8, 13,14

[12] M. Gygli, H. Grabner, and L. Van Gool. Video summarizationby learning submodular mixtures of objectives. In CVPR, 2015.1, 3, 7, 8, 13, 14

[13] J. Hays and A. A. Efros. Scene completion using millions ofphotographs. ACM Transactions on Graphics, 26(3):4, 2007. 2

[14] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles.Activitynet: A large-scale video benchmark for human activityunderstanding. In CVPR, 2015. 1

[15] R. Hong, J. Tang, H.-K. Tan, S. Yan, C. Ngo, and T.-S. Chua.Event driven summarization for web videos. In SIGMM Work-shop, 2009. 1, 2

[16] H.-W. Kang, Y. Matsushita, X. Tang, and X.-Q. Chen. Space-time video montage. In CVPR, 2006. 1, 2

[17] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan. Large-scalevideo summarization using web-image priors. In CVPR, 2013.1, 2, 3

[18] G. Kim, L. Sigal, and E. P. Xing. Joint summarization of large-scale collections of web images and videos for storyline recon-struction. In CVPR, 2014. 3

[19] J. Kim and K. Grauman. Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental up-dates. In CVPR, 2009. 2

[20] D. Koller and N. Friedman. Probabilistic graphical models:principles and techniques. MIT press, 2009. 3

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifi-cation with deep convolutional neural networks. In NIPS, 2012.14

[22] A. Kulesza and B. Taskar. Learning determinantal point pro-cesses. In UAI, 2011. 3

[23] A. Kulesza and B. Taskar. Determinantal point processes for ma-chine learning. Foundations and Trends R© in Machine Learning,5(2–3), 2012. 3, 4

[24] R. Laganiere, R. Bacco, A. Hocevar, P. Lambert, G. Paıs, andB. E. Ionescu. Video summarization from spatio-temporal fea-

tures. In ACM TRECVid Video Summarization Workshop, 2008.1, 2

[25] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important peo-ple and objects for egocentric video summarization. In CVPR,2012. 1, 2, 3, 17

[26] Y. Li and B. Merialdo. Multi-video summarization based onvideo-mmr. In WIAMIS Workshop, 2010. 8

[27] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsingvia label transfer. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 33(12):2368–2382, 2011. 2

[28] D. Liu, G. Hua, and T. Chen. A hierarchical visual model forvideo object summarization. Pattern Analysis and Machine In-telligence, IEEE Transactions on, 32(12):2178–2190, 2010. 1,2

[29] T. Liu and J. R. Kender. Optimization algorithms for the selec-tion of key frame sequences of variable length. In ECCV, 2002.1, 2

[30] Y. Lu, X. Bai, L. Shapiro, and J. Wang. Coherent parametriccontours for interactive video object segmentation. In CVPR,2016. 1

[31] Z. Lu and K. Grauman. Story-driven summarization for egocen-tric video. In CVPR, 2013. 1

[32] J. Luo, C. Papin, and K. Costello. Towards extracting seman-tically meaningful key frames from personal video clips: fromhumans to computers. Circuits and Systems for Video Technol-ogy, IEEE Transactions on, 19(2):289–301, 2009. 7, 13

[33] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li. A user attention modelfor video summarization. In ACM Multimedia, 2002. 1, 2

[34] A. G. Money and H. Agius. Video summarisation: A concep-tual framework and survey of the state of the art. Journal of Vi-sual Communication and Image Representation, 19(2):121–143,2008. 2

[35] P. Mundur, Y. Rao, and Y. Yesha. Keyframe-based video sum-marization using delaunay clustering. International Journal onDigital Libraries, 6(2):219–232, 2006. 1, 8

[36] J. Nam and A. H. Tewfik. Event-driven video abstraction andvisualization. Multimedia Tools and Applications, 16(1-2):55–77, 2002. 1, 2

[37] C.-W. Ngo, Y.-F. Ma, and H. Zhang. Automatic video summa-rization by graph modeling. In ICCV, 2003. 1, 2

[38] F. Perronnin and C. Dance. Fisher kernels on visual vocabulariesfor image categorization. In CVPR, 2007. 7

[39] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In ECCV, 2014. 3, 6, 7, 8, 13,14

[40] Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg. Webcam syn-opsis: Peeking around the world. In ICCV, 2007. 1, 2

[41] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes. Tvsum: Sum-marizing web videos using titles. In CVPR, 2015. 6, 8, 13

[42] M. Sun, A. Farhadi, B. Taskar, and S. Seitz. Salient montagesfrom unconstrained videos. In ECCV, 2014. 1

[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper withconvolutions. In CVPR, 2015. 14, 15

[44] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny im-ages: A large data set for nonparametric object and scene recog-nition. Pattern Analysis and Machine Intelligence, IEEE Trans-actions on, 30(11):1958–1970, 2008. 2

[45] B. T. Truong and S. Venkatesh. Video abstraction: A system-atic review and classification. ACM Transactions on MultimediaComputing, Communications, and Applications, 3(1):3, 2007. 2

[46] W. Wolf. Key frame selection by motion analysis. In ICASSP,1996. 2

[47] Z. Wu, Y. Fu, Y.-g. Jiang, and L. Sigal. Harnessing object andscene semantics for large-scale video understanding. In CVPR,2016. 1

[48] J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh.

http://www.open-video.org/

Gaze-enabled egocentric video summarization via constrainedsubmodular maximization. In CVPR, 2015. 3

[49] Y. Yang, G. Sundaramoorthi, and S. Soatto. Self-occlusions anddisocclusions in causal video object segmentation. In ICCV,2015. 1

[50] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar. An integratedsystem for content-based video retrieval and browsing. Patternrecognition, 30(4):643–658, 1997. 1, 2

[51] B. Zhao and E. P. Xing. Quasi real-time summarization for con-sumer videos. In CVPR, 2014. 2

Supplementary MaterialIn this Supplementary Material, we give out details

omitted in the main text: learning parameters in sec-tion A (section 3.3 in the main text), datasets, featuresand evaluation metrics for our experimental studies insection B, and C (section 4.1 in the main text), analysisand results in section D (section 4.2 in the main text),and further discussions in section E.

A. LearningA.1. Maximum Likelihood Estimation

We use gradient descent to maximize the log-likelihood

∑Rq=1 logP (yq; Lq), defined in (13) of the

main text, w.r.t. α and Ω:

(α∗,Ω∗) = arg maxα,Ω

R∑q=1

logP (yq; Lq). (14)

For brevity, we ignore the subscript q in the following.The corresponding partial derivatives w.r.t. α and Ω areshown below, according to the chain rule:

∂ logP (y; L)

∂αr

=∑m,n

∑k,l

∂ logP (y; L)

∂(L)mn

∂(L)mn

∂Lr,kl

∂Lr,kl

∂αr(15)

∂ logP (y; L)

∂Ω

=

R∑r=1

∑m,n

∑k,l

∂ logP (y; L)

∂(L)mn

∂(L)mn

∂Sr,kl

∂Sr,kl

∂Ω(16)

To ease the computation in what follows, we define Mas a matrix with the same size of L, andMy = (Ly)−1.All the other entries ofM are zero.

A.2. Gradients with respect to the scale param-eters

∂ logP (y; L)

∂αr

=∑m,n

∑k,l

∂ logP (y; L)

∂(L)mn

∂(L)mn

∂Lr,kl

∂Lr,kl

∂αr

= 1>(Sr>(M − (L+ I)−1)Sr

) Ir1,

(17)

where is the element-wise product. Ir is a diagonalmatrix (of the same size as Lr), with entries indexed byyr as one; otherwise, zero. 1 is the all one vector with asuitable size.

A.3. Gradients with respect to the transforma-tion matrix

∂ logP (y; L)

∂Ω

=

R∑r=1

∑m,n

∑k,l

∂ logP (y; L)

∂(L)mn

∂(L)mn

∂Sr,kl

∂Sr,kl

∂Ω

= 4Ω(ΦCrΦr

> + ΦrC>r Φ>

− (ΦC(1)r Φ> + ΦrC

(2)r Φr

>)),

(18)

where Cr,C(1)r ,C

(2)r are defined as:

Cr = ((M − (L+ I)−1)SrLr) Sr,

C(1)r = diag(Cr1)

C(2)r = diag(1>Cr).

(19)

Φr and Φ are the column-wise concatenation (matrices)of φr and φ, respectively.

A.4. Extension for sequential modeling

Prior work seqDPP [10] shows that the vanilla DPPis inadequate in capturing the sequential nature of videodata. In particular, DPP models a bag of items suchthat permutation will not affect selected subsets. Clearly,video frames cannot be randomly permuted without los-ing the semantic meaning of the video. The authors pro-pose the sequential DPP (seqDPP), which sequentiallyextracts summaries from video sequences.

Extending our approach to this type of sequentialmodeling is straightforward due to its non-parametricnature. Briefly, we segment each video in the train-ing dataset into smaller chunks and treat each chunk asa separate “training” video containing a subset of theoriginal summary. We follow the same procedure as de-scribed above to formulate the base summarization ker-nels for those shorter videos and their associated sum-maries. Likewise, we learn the parameters α and Ω foreach segment.

During testing time when we need to summarize anew video Y , we segment the video Y in temporal or-der: Y = Y1 ∪ Y2 · · · YT . We then perform the sequen-tial summarization procedure in the seqDPP method.Specifically, at time t, we formulate the ground set asUt = Yt ∪ yt−1, i.e., the union of the video segment t

and the selected subset in the immediate past. For theground set Ut, we calculate its kernel matrix using the(segmented) training videos and their summaries. Wethen perform the extraction. The recursive extractionprocess is exactly the same as in [10] and we hence omitit here.

In our experiments, we use sequential modeling forthe Kodak, OVP, YouTube and SumMe datasets. Forthe MED dataset and subshot-based experiments onYouTube dataset, we directly perform our model on eachvideo without sequential modeling as the number of sub-shots for each dataset is few. We perform KTS [39] todivide a video into segments.

B. DatasetsWe validate our approach on five benchmark datasets.

In this section we give detailed descriptions for each ofthem.

B.1. Kodak, OVP and YouTube dataset

We perform keyframe-based video summarization onthree video datasets: the Open Video Project (OVP)[1, 4], the YouTube dataset [4], and the Kodak con-sumer video dataset [32]. They contain 50, 39, and 18videos, respectively. In the main text, to investigate theeffectiveness of category-specific prior, we use CategorySports (16 videos) and News (15 videos), i.e., 31 out of39 videos. The OVP and YouTube datasets have fivehuman-created frame-level summaries per video, whileKodak has one per video. Thus, for the first two datasets,we follow the procedure described in [10] to create anoracle summary per video. The oracle summaries arethen used to optimize model parameters of various meth-ods.

As suggested in [4], we pre-process the video framesas follows. We uniformly sample one frame per sec-ond for OVP and YouTube, two frames per second forKodak (as the videos are shorter) to create the groundset Y for each video. We then prune away obviouslyuninformative frames, such as transition frames close toshot boundaries and near-monotone frames. On average,the resulting ground set per video contains 84, 128, and50 frames for OVP, YouTube, and Kodak, respectively.We replace each frame in the human-annotated summarywith the temporally closest one from the ground set, ifthat is not already contained in the ground set.

B.2. SumMe dataset

The SumMe dataset [11, 12] consists of 25 videos,with the average length being 2m40s. Similar to [39],each video is cut and summarized by 15 to 18 people,

and the average length of the ground-truth (shot-based)summary is 13.1% of the original video. We thus per-form subshot-based summarization on this dataset.

The video contents in SumMe dataset is heteroge-neous and it does not come with pre-defined categories.However, some of them have various degrees of relat-edness. For example, videos with title ’Kids playing inleaves’ and ’Playing on water slide’ are about childrenplaying. Thus, it is interesting to investigate if categoryprior can help improve summarization on this dataset.

Instead of manually labeling each video, we proposetwo synthetic categories and split videos into two par-titions. We first collapse the 10 video categories in thedataset TVSum3 [41] into two super-categories, denotedas Super-category I and II, respectively. Super-categoryI includes activities such as attempting bike tricks, flashmob gathering, park tour and parade, which mostly haveraw videos with crowds, and Super-category II includesthe rest types of activities with much less people interac-tion. These two super-categories are semantically simi-lar within each other, though they do not have obviousvisual similarity to videos in the SumMe. We then builda binary classifier trained on TVSum videos but classifythe videos in SumMe as Super-category I and Super-category II, and then proceed as if they are ground-truthcategories, as in MED and YouTube.

Since the shot boundaries given by users vary evenfor the same video, to create an oracle summary of eachvideo for training, we first score each frame by the num-ber of user summaries containing it. We then obtain ourshot boundaries based on KTS [39] and compute theshot-level scores by averaging the frame scores withineach shot. Finally we rank the shots and select the topones that are combined to have around 15% of the orig-inal video to be the oracle summary for each video.

B.3. MED dataset

The MED dataset, developed in [39], is a subsetof the TRECVid 2011 MED dataset. The dataset has12,249 videos: 2,389 videos in the following 10 cate-gories (namely Birthday party, Changing a vehicle tire,Flash mob gathering, Getting a vehicle unstuck, Groom-ing an animal, Making a sandwich, Parade, Parkour, Re-pairing an appliance, and Working on a sewing project)and 9,860 from none of those (labeled as ’null’). 160videos from the above 2,389 videos are annotated withsummaries. For each video in the dataset, an annotatoris asked to cut and score all the shots in a video (from 0

3We choose this one as it has raw videos for us to extract featuresand have a larger number of labeled videos for us to build a categoryclassifier.

to 3). We thus perform subshot-based summarization onthis dataset.

There are several constraints in using this dataset: thedataset does not provide the original videos, only bound-aries of previously segmented shots and pre-computedfeatures (on average 27 shots per video and their Fishervectors). Those shot boundaries are often different fromthe shots determined by human annotators.

To use this dataset for subshot-based summariza-tion, we first need to combine all the human annotators’shot scores into oracle summaries for training. To thisend, we first map each users’ annotations to the pre-determined shots — in particular, each frame in its pre-determined shot “inherents” the importance scores fromthe human annotators’ scores of the subshots if the sub-shots contain the frame. We then compute shot-levelimportance scores by taking an average of the frame im-portances with in each pre-determined subshot. We thenselect shots having the top K% importance scores asground-truth. We then use this ground-truth for training.In this paper, we test our model under different summarylength, i.e. K = 15 and 30.

B.4. Features

For the Kodak, OVP, and YouTube datasets, we ex-tract Fisher vectors, color histograms, and CNN features(based on GoogLeNet [43]). In our experiments, weuse Fisher vectors and color histogram for these threedataset, except the comparison in section D.4. For theMED dataset, as the author didn’t provide the raw video,we use the Fisher vectors given in the dataset. ForSumMe dataset, we use the output of layer-6 of AlexNet[21] as features.

C. Evaluation protocolsIn all our experiments, we evaluate automatic sum-

marization results (A) by comparing them to the human-created summaries (B) and reporting the F-score (F),precision (P), and recall (R), shown as follows:

Precision =#matched pairs

#frames (shots) in A× 100%, (20)

Recall =#matched pairs

#frames (shots) in B× 100%, (21)

F-score =P ·R

0.5(P +R)× 100%. (22)

For datasets with multiple human-created summaries,we average or take the maximum over the number ofhuman-created summaries to obtain the evaluation met-

rics for each video. We then average over the number ofvideos to obtain the metrics for the datasets.

C.1. Kodak, OVP and YouTube datasets

We utilize the VSUMM package [4] to evaluate videosummarization results. Given two sets of summaries,one formed by an algorithm and the other by human an-notators, this package outputs the maximum number ofmatched pairs of frames between them. Two frames arequalified to be a matched pair if their visual difference4

is below a certain threshold, while each frame of onesummary can be matched to at most one frame of theother summary. We then can measure F-score based onthese matched pairs.

C.2. SumMe

We follow the protocol in [11, 12]. As they constrainthe summary length to ≤ 15% of the video duration, wetake the shots selected by our model (and if their du-ration exceeds 15%, we further rank them by their val-ues on the diagonal of L-kernel matrix — higher valuesimply more important items, in our probabilistic frame-work for subset selection). We then take the highestranked shots and stop when their total length exceeds15% of the video duration.

In computing the F-score, [11, 12] use the same for-mulation in comparing the summarization result to auser summary. However, when dealing with multipleuser summaries on one video, instead of taking the aver-age as in [11], [12] reports the highest F-score comparedto all the users. In this paper, we follow [12].

C.3. MED

For MED, we conduct subshot-based summarization.We follow the same procedure to align each user’s anno-tation with pre-determined shot boundaries in the dataset(cf. section 2.3 in this Suppl. Material) to create usersummaries. We then evaluate similarly as we have doneon the Kodak, OVP and YouTube datasets, treating sub-shots as “frames”. Since the KVS algorithm [39] cannotdecide the summarization length but only scores eachshot, we simply take the top 15% or 30% shots (orderedby scores) to be the summary of KVS.

4VSUMM computes the visual difference automatically based onthe color histograms.

Table 6: For our method, better frame-based visual sim-ilarities improve summarization quality. We report 100-round results for Kodak, OVP and 5-round results forthe remaining.

sim1 sim2 sim3

Kodak 76.6±0.6 80.2±0.3 82.3±0.3

OVP 71.4±0.3 75.3±0.3 76.5±0.4

YouTube soft 58.9±1.7 60.6 ±1.4 60.9±1.3

YouTube hard 59.6±1.4 61.5±1.5 61.5±1.5

Table 7: Results on YouTube dataset (39 videos), allbased on 100 rounds.

VSUMM1 [4] seqDPP [10] OursYouTube 56.9±0.5 60.3±0.5 60.2±0.7

D. Detailed experimental results and analy-sis

D.1. Benefits of learning visual similarity

We further examine the effect of employing differ-ent measures of visual similarity. As shown in Table6, nonlinear similarity with the Gaussian-RBF kernel(sim2) generally outperforms linear similarity (sim1),while learning a non-identity metric Ω (sim3) improvesonly marginally the performance. In the following ex-periments, we use sim2.

D.2. Results on the complete YouTube dataset

The original YouTube dataset [4], after excluding thecartoon videos as in [10], contains 39 videos with 8 ofthem in neither Sports nor News category. In the maintext, we have used 31 videos mainly for the purposeto compare to summarization results exploiting categoryprior.

For maximum comparability to prior work on thisdataset, we provide our summarization results on the 39videos and compare to previous published results, as inTable 7.

D.3. Detailed results with category prior

The results for the YouTube, SumMe and MEDdatasets are shown in Table 8, 9 and 10, respectively.

We can see that both the soft category-specific andthe hard category-specific outperform the no category-specific setting in most cases. For example, a ’birthdayparty’ video is likely to share structures with an ’outdooractivity’ video, thus helping to summarize each other.

Table 9: Category-specific experimental results on SumMedataset. SP I and SP II stands for super-category I and II, re-spectively. We report 5-round results.

Setting Testing Category determined by a classifier

SumMeSP I 38.6±0.5

39.2±0.7SP II 40.7±0.7

SumMe softSP I 39.2±0.6

40.2±0.7SP II 41.9±0.7

SumMe hardSP I 39.8±0.8

40.9±0.9SP II 43.3±0.9

Table 11: The comparison of shallow and deep features forsummarization. We report 5-round results.

Shallow DeepKodak 80.2±0.3 79.1±0.4

YouTube 60.0±1.3 59.3±1.5

YouTube soft 60.6±1.4 59.6±1.6

YouTube hard 61.5±1.5 61.0±1.6

Additionally, even when the ground-truth categoriesfor the testing videos are unknown and need to be deter-mined with a category classifier, the category prior canstill help.

D.4. Comparison between deep and shallow fea-tures

Here we present results on contrasting deep to shal-low features to show that deep features for visual recog-nition do not help much over shallow features. Asshown in Table 11. We compare Fisher vector andcolor histogram with state-of-the-art CNN features byGoogLeNet [43].

D.5. Comparison to seqDPP

Our method outperforms seqDPP [10] on Kodak andYouTube (cf. Table 2 in the main text), whose contentsare more diverse and more challenging than OVP. Wesuspect that since videos in OVP are edited and thusless redundant, methods with higher precision such asseqDPP might be able to perform better than methodswith higher recall (such as the proposed approach).

At testing time, our model requires more computa-tion. On YouTube, it takes 1 second on average pervideo, slower than 0.5 second by seqDPP. However, ourmodel is far advantageous in training. First, it learnsmuch fewer parameters (cf. in the main text, about 9,000for α in eq. (5) and diagonal Ω in eq. (4)), while se-qDPP requires tuning 80,000 parameters. Our modelthus learns well even on small training datasets, shown

Table 8: Exploiting category prior with YouTube dataset, which contains totally 31 videos in Sport and News categories. We reportF-score in each category as well as on the whole dataset, averaging over 5 rounds of experiments.

Setting Testing video’s category Category prior not used

YouTubeSports 53.5±1.5

News 66.9±1.2

Combined 60.0±1.3

Ground-truth category is used Category determined by a classifier

YouTube soft

Sports 53.4±1.5 53.4±1.5

News 68.2±1.4 67.5±1.6

Combined 60.6±1.4 60.2±1.6

YouTube hard

Sports 54.4±1.6 54.4±1.6

News 69.1±1.4 68.3±1.8

combined 61.5±1.5 61.1±1.8

Table 10: Category-specific experimental results on MED summaries dataset with 10 categories. We consider two settings,K = 15 and K = 30, as mentioned in section B.3 and C.3. We report 5-round results.

(a) User summaries at K = 15

Setting Category prior not usedMED 28.9±0.8

Ground-truth category is used Category determined by a classifierMED soft 30.7±1.0 29.4±1.2

MED hard 30.4±1.0 28.5±1.3

(b) User summaries at K = 30

Setting Category prior not usedMED 47.2±0.7

Ground-truth category is used Category determined by a classifierMED soft 48.6±0.8 45.9±1.0

MED hard 48.1±1.1 44.5±1.1

on Kodak (cf. in section 4.3 of the main text). Secondly,learning seqDPP is computationally very intensive —according to its authors, a great deal of hyperparame-ter tuning and model selection was performed. Learningour model is noticeably faster. For example, on Youtube,our model takes about 1 minute per configuration of hy-perparameters, while seqDPP takes 9 minutes.

D.6. Qualitative comparisons

We provide exemplar video summaries in Fig. 3 (andattached movies), along with the human-created sum-maries. Since OVP and YouTube have five human-created summaries per video, we display the merged or-acle summary (see section B.1) for brevity.

In general, our approach provides summaries mostsimilar to those created by humans. We attribute thisto two factors. Firstly, employing supervised learninghelps identify representative contents. VSUMM1 [4],though achieving diversity via unsupervised cluster-

ing, fails to capture such a notion of representative-ness, resulting in a poor F-score and a drasticallydifferent number of summarized frames compared tothe oracle/human annotations. Secondly, through non-parametrically transferring the underlying criteria ofsummarization, the proposed approach arrives at bettersummarization kernel matrices L than seqDPP, furthereliminating several uninformative frames from the sum-maries (thus improving precision). The bottom failurecase, however, suggests how to improve the method fur-ther, in the direction of increasing recall rate. We be-lieve those missing frames (thus, shorter summaries) canbe recalled if we could consider jointly the kernel com-puted explicitly on the test videos (such as seqDPP) andthe kernel computed by our method.

Oracle

VSUMM1

(F = 69)

seqDPP

(F = 74)

Ours

(F = 84)

OVP(Video 60)

Oracle

VSUMM1

(F = 54)

seqDPP

(F = 57)

Ours

(F = 74)

YouTube(Video 76)

Oracle

VSUMM1

(F = 87)

seqDPP

(F = 88)

Ours

(F = 73)

OVP(Video 25)

Figure 3: Exemplar video summaries generated by VSUMM1 [4], seqDPP [10], and ours (sim3), along with the (merged)human-created summaries. We index videos following [4]. On two videos, our approach reaches the highest agreement withthe oracle/human-created summaries compared to the other methods. The failure case on the bottom hints the limitation, however.See texts for details.

E. Detailed discussions

E.1. Computational complexity and practicality

Section 3.5 of the main text discusses the computa-tional cost and ways for reducing it. Specifically, thecost depends on (1) the length of the training and testingvideos; (2) the number of training videos.

For (1), we can down-sample the videos to reducethe number of frames (cf. section 3.5 of the main textand section B.1 for details), and exploit subshot-basedtransfer (section. 3.4 of the main text). For the lat-ter, the size of L (eq. (11) of the main text) dependsonly on the number of subshots (not frames), and thesubshot-to-subshot similarity further reduces computa-tional cost, without worsening the performance (cf. Ta-ble 5 of the main text). Moreover, the sequential model-ing trick in [10] can also be used to reduce the cost (seesection A.4).

For (2), the proposed hard category-specific transfer(cf. section 3.4 of the main text) enables transferring

from fewer training videos with improved performance.

E.2. Applicability to long egocentric videos

Egocentric video is challenging due to its length anddiverse content (e.g., multiple events). One possiblestrategy is to apply existing techniques to detect and seg-ment the videos into shorter events and then perform ourapproach on shorter segments, or use methods describedin [25] to zoom into frames surrounding the occurrenceof important people and objects.

E.3. Mechanisms to check for failure

It is interesting to investigate when our approach willfail in practice. As our approach is non-parametric bytransferring the summarization structures from anno-tated videos, the visual similarities between the testingand annotated videos might be a useful cue. We con-duct analysis (cf. Fig. 4) and show that there is indeedpositive correlation between visual similarity and sum-marization quality. A preliminary fail-safe mechanism

thus would be thresholding on the similarity.

0 5 10 15

Rank

0

0.5

1

F s

core

Youtube_Sports

0 5 10 15

Rank

0.5

1

F s

co

re

Youtube_News

Figure 4: Visual similarity (x-axis, smaller ranks implystronger similarity) between the testing and annotated videosis positively correlated to summarization quality (F-score on y-axis). Results are from summarizing YouTube via hard trans-fer.

arxiv:1603.03369v3 [cs.cv] 29 apr 2016 · based skims [24,31,36,37], story-boards [6,9], mon-tages...

Documents