[acm press the 10th international conference - beijing, china (2011.12.07-2011.12.09)] proceedings...

Video Summarization with Semantic Concept Preservation

Zheng YuanUniversity of Florida

Gainesville, [email protected]

Taoran LuDolby TechnologiesLos Angeles, USA

[email protected]

Dapeng WuUniversity of Florida

Gainesville, [email protected]

Yu Huang,Huawei Technologies

Bridgewater, [email protected]

Heather YuHuawei Technologies

Bridgewater, [email protected]

ABSTRACTA compelling video summarization should allow viewers tounderstand the summary content and recover the originalplot correctly. To this end, we materialize the abstract el-ements that are cognitively informative for viewers as con-cepts. They implicitly convey the semantic structure andare instantiated by semantically redundant instances. Thenwe analyze that a good summary should i) keep various con-cepts complete and balanced so as to give viewers compara-ble cognitive clues from a complete perspective ii) pursue themost saliency so that the rendered summary is attractive tohuman perception. We then formulate video summarizationas an integer programming problem and give a ranking basedsolution. We also propose a novel method to discover the la-tent concepts by spectral clustering of bag-of-words features.Experiment results on human evaluation scores demonstratethat our summarization approach performs well in terms ofthe informativeness, enjoyability and scalibility.

KeywordsVideo summarization, integer programming, attention model

1. INTRODUCTIONThe fundamental purpose of video summarization is to

epitomize a long video into a succinct synopsis, which al-lows viewers to quickly grasp the general idea of originalvideo. The resultant summary provides a compact repre-sentation of the original content structure. Although brief,a good summary preserves all necessary hallmarks of theoriginal video and viewers are sufficiently able to recoveroriginal content through reasoning and imagination. Basedon the cognitive level (signal and semantic) where a met-ric function lies, we categorize current summarization tech-niques into three types. Type I utilizes signal level featuresto measure the difference of a video summary from its origi-nal. Various implementations include the motion trajectory

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MUM ’11, Dec. 7-9, 2011, Beijing, ChinaCopyright 2011 ACM 978-1-4503-1096-3/11/12 ...$10.00.

curve [2] and summarized PSNR [4]. All these metrics aremanipulations of pure intensities and in essence measure thevisual diversity contained in a summary. Hence the maxi-mization leads to the summary with most content diversity,but not necessarily the one that presents most importantclues that enhance viewers’ understanding.

Type II characterizes with high level semantic analysis,in which semantic events with explicit meanings are recog-nized. Generally, semantics are defined explicitly by someoffline database, which annotates its entries with meaningfultags. Through supervised learning from labeled data, var-ious methods in this category detects events with explicitmeanings. Typical implementations include the “who andwhat inquires” [6], “who, what, where and when entities” [1].However, due to the limitation of current computer intelli-gence, recognizing an entity with explicit meanings is a rig-orous work owing to the well-known“semantic gap”problem.

Type III lies in the intermediate level, seeking entities withimplicit meanings. Some researchers in [8, 9] assume thesemantics are implicitly expressed by popular human per-ception models and they yield summaries with most salientvideo units. Unfortunately, salient features do not necessar-ily mean semantic distinguishable as they basically measurehow interesting of a video while the interesting part may bean important clue for understanding or may be not.

In this paper, we propose an intermediate level approachto explore the implicit semantics of original video. We pur-sue a self-explanatory video summary through discoveringand preserving“concepts”. Concepts symbolize abstract clueson which viewers base to recover the original plot thread; al-though on the intermediate level, they are patterns learnedfrom the original video and represent implicit semantic mean-ings, rather than in an explicit and rigorous manner but withless generalization. The motivation of concept is intuitive:emulating the human cognitive process, naturally a list ofkey patterned hints such as characters, settings, actions andtheir orders are needed in the short summary for viewersto stitch these hints logically and use imagination to fill theomitted part. The concepts correspondingly encode the se-mantic meanings of patterned hints. Specifically, we extractvisual features and use spectral clustering to discover “con-cepts” and consider the repetition of shot segments whichinstantiate the same concept as summarization redundancy.We further analyze the criteria of a good summary and for-mulate the summarization as integer programming problem.A ranking based solution is given.

The main contributions of this paper are therefore:

109

Figure 1: The semantics in “the big bang theory”.

(i) A in-depth analysis on the philosophy of video sum-marization and answered the key questions: what are thesemantically important elements and redundances; and howto keep the semantics in the summary.(ii) Proposed a clustering based method to discover the

semantic structure of the video as concepts and instances.(iii) Proposed a ranking based method for summary gen-

eration, which is a scalable framework that adapts the detailpreservation to the summary length.

2. THE SUMMARIZATION PHILOSOPHYIn order to capture the video semantics, the first question

is how the semantics are expressed in the video. Observingmany meaningful videos, we find out the major semanticsembedded are a plot outline these videos try to express.Generally, the outline can be abstracted in a sequence of“entity combinations” (e.g. in Fig. 1, the four persons, asfour entities, are combined together to have a conversation).Each “entity combination” forms a particular concept thatexhibits certain semantic implications. Here, concept differsfrom the “cast list” [11] in that the constituent entities arenot necessarily“faces”or“actors”but may include any entitythat is distinguishing to characterize the concept. Also, thecombination are not a simple list of entities, but emphasizetheir interaction: through certain postures and actions of theentities, the semantic is vividly implied within the concept.These concepts shape the course of the plot, thus are the

skeleton of the video. Furthermore, each concept is mate-rialized by multiple video shots (we call them instances ofa concept). Each shot instance, as the muscle of the video,elaborates the concept from a specific perspective. (e.g. inFig. 1, the concept of having conversation is expressed byshot 1 to shot 17. Some instances depict one person whileothers depict another, together they convey four persons arehaving a conversation.) However, since the instances for theportrait the same concept, they are destined to include re-dundancies in the summarization sense.Therefore, the concepts can be seen as a parsimonious

abstraction of video semantics, which suits our summariza-tion purpose very well: by eliminating unnecessary instanceswith semantic redundancies but keeping the concepts intactin the summary, we shorten the long video without com-pressing the semantics. The preservation of a concept maybe fulfilled by keeping the instances with more distinguish-able characteristic to express the concept in higher priori-ties. From viewer’s perspective, if viewers are informed withthe semantic skeleton, they are able to refill a full body ofvideo outline using reasoning: they witness the concepts andthose entities within and concatenate them into a plot thread

Figure 2: Left: two semantic similar frames, Right top:

words and codewords, bottom: a BoW vector

based on experience or even imagination, thus understandthe video. For example, In Fig. 1, when all concepts in themiddle row are available in summary, viewers are capable torealize the four people are having a conversation. Therefore,the video summarization consists of two phases: exploringthe inherent concept-instance structure followed bypreserving those prioritized instances in the summary.

3. RECOGNIZE THE CONCEPTSIn our context, discovering the concepts is an unsuper-

vised learning problem. We first divide a full video intomultiple shots based on signal level change detection andregard each shot as a sample. Then we use an unsuper-vised learning method to cluster the samples into severalclasses. Each class of samples, collectively, express a con-cept. For feature extraction, the concept defined as “en-tity combination” implies the samples within the same con-cept/class share the similar interaction of entities. Thus, afeature vector capturing the constituents of entities of a shotis required, preferably with the robustness to variations ofentities due to postures, scale and locations. For that pur-pose, we propose the bag-of-words (BoW) model: “Bag” isa shot instance regarded as a container and “Words” are thevisual entities. The BoW model provides the macro look ofsemantic ingredients within an instance and emphasizes theconceptual similarity among instances, considering entity at-tendance only. In Fig. 2, a far view of rose and butterfly anda close-up look of the same entities should both imply thesame semantic implications, despite the rose and butterflymay appear in different locations or scales. Evidently, theBoW feature graciously expresses the order-irrelevant prop-erty and measures the two images in similar values.

The next question is how to construct the BoW featurefor each shot. For the representation of “words”, ideally itshould feature a perfectly extracted entity. However, dueto the limitations of current object detection techniques, wealternatively use the SIFT feature points [5] within the de-tected salient objects [7] to represent an entity/words. Pars-ing throughout the original video, all occurring “words” con-stitute a “dictionary”, which describes a space of all entitiesoccurring in the video. In order to distinguish those sig-nificant entities, we compress the “words” into “codewords”,which convey characteristic entities after dimension reduc-tion. For each sample/shot, we count the number of oc-currence of each “codeword” as its final BoW feature vector.Fig. 2 shows the original“words” for a frame, the compressed“codewords” and also the bag-of-words feature. Finally, weexploit spectral clustering [10] to cluster shots into severalclasses, each of which corresponds to a concept.

4. THE SUMMARIZATION METHODOLOGY

110

4.1 Criteria of a Good Video SummaryThe most significant criterion of a good summary is to

preserves all original concepts. If there is absence of anyconcept in the summary, viewers may be misled by fractionalclues and fantasize a totally deviated plot.Also, concepts should be rendered with balance, i.e. in the

summary there are equal or comparable occurrences of in-stances for each concept. As every concept in the summaryare semantically compact and decisive, ignoring or overem-phasizing anyone due to imbalanced rendering may misleadthe viewers to recover a plot deviating from the original.The third criterion is that each concept is preferably ren-

dered by the most interesting instances. Then for a givenlength, the summary not only preserve the semantics butalso triggers viewers most interests to recover the plot.

4.2 Constrained Integer ProgrammingGiven a clear semantic structure, we model video sum-

marization as an integer programming problem. Let cij bethe binary indicator that denotes if we retain an instancein the summary or not, sij is the jth instance for concepti. We want to find a combination of instances indicated byc = {ci}, which aim to deliver all concept with most in-terestingness, or equivalently, maximizes the total attentionsaliency for each concept.

ci = argci max∑nj

j=1 cijξ(sij)∀i = 1 · · ·N (1)

s.t.∑N

i=1

∑nj

j=1 cij ≤ r (2)

min (⌊ rN⌋, ni) ≤ ki =

∑nij=1 cij ≤ min (⌈ r

N⌉, ni) (3)

∀i ∈ 1, ..., N

c ∈ {cij | argmaxcij

∑Ni=1

∑nij=1 cijIi} (4)

where ξ(sij) is the saliency of sij , ni is the number of in-stances for the ith concept in the original video and ki isthe number of instance of concept i in the summary. N isthe total number of concepts in the original video. r is themaximum number of instances we can retain in the summa-rized video. Ii is the importance rank of concept i, which isdependent on the number of frames for this concept pattern,based on the common sense that director may naturally dis-tribute more time to portrait a more important concept.Constraint I in Eq. 2 comes from the summary length

limit. In our approach, we ceil the maximum number offrames in a detected a shot below some constant number,thus r is almost proportional to the predefined summariza-tion ratio. Constraint II in Eq. 3 is imposed by the conceptcompleteness and balance criterion. Given a summarizationratio, we always try to deliver all concepts using commeasur-able number of shot instances. Constraint III in Eq. 4 dealswith critical situations when r decreases to a small value sothat the summary could not keep all concepts with even oneshot instance for each or in the situation that a summarylength does not allow all concepts to be delivered with ab-solutely equal number of instances. Here we give priority toconcepts with larger importance to be kept in the summary.Considering constrain II in Eq. 3, it is required the number

of instances for each concept should be almost the same, ifachievable, no matter how long the summary is. Therefore,we construct the summary in a bottom-up fashion: start-ing from an empty summary, every time we allow only oneinstance from one concept to be admitted in the summaryand continue this process until the resultant summary length

reach its limit. Therefore, no matter when the process ter-minates, we may guarantee that the numbers of instancesfor different concepts differs only by one (except the timethat all instances are recruited in a long summary).

The scalability of the summary is straightforward: if thesummary length is short, we keep a modest number of mostdistinguishable instances for every concept. If the summarylength is longer, we may add more instances for every con-cept, rendering a more detailed summary. This trait toadapt detail preservation in the summary to variable lengthsuits very well for the needs of ubiquitous multimedia ter-minals that accepts different levels of summaries.

In the algorithm, which instance will be included in thesummary is the key question. An instance is indexed bythe rank of concept it belongs to and also its rank withinthe concept. As aforementioned assumption, we rank theconcepts according to its length in the original video. Asfor the ranking within a concept, we rank the instances of aconcept according to its saliency value. Thus, the admittedinstances for a concept contain the highest interestingnessaccumulation for the given summary length, fulfilling theobjective function in Eq. 1. Here, based on human attentionmodeling in [7], we define the visual saliency ξv(si) of a shotsi is the average saliency over all frames t within the shot.

ξv(si) =1

|si|

|si|∑t=1

π(t) (5)

5. RESULTS

5.1 An Example to Illustrate Our AlgorithmIn this section, we evaluate the performance of the pro-

posed concept recognition method and also the overall videosummarization approach, with two summarization ratio 50%and 20% used. We implement the proposed approach inC code. The evaluation video sequence is the 10-minuteBig Buck Bunny, which comes from the Open Movie projectwithout license issue. Fig. 3 shows the recognized concepts.Since the original video is too long to visualize, we illustratethe frames that are sampled every 15 seconds from the orig-inal video. In Fig. 3, each concept consists of the frames inthe same border color. It shows that the frames that expressthe similar semantics are clustered into the same conceptindeed. For example, the frames in red box all depict thebunny is playing with the butterfly; the frames in green boxall show that the bunny is standing by himself. Also, wevisualize three frames with no border color as they are theoutliers of clustering. This result can be explained by theirnegligible occurrences in the video: they are too trivial toexpress a concept. For the number of concepts (the com-plexity of clustering model), we enumerate 5, 10, 15, 20 and25 as the candidates to measure the clustering error and useOccam’s razor rule to finalize it as 10. The clustering resultindicates our concept recognition method works well.

Fig. 4 shows the summary results of 50% and 20% ra-tio, respectively. Note that the 20% summary is a subsetof the 50% summary. This result suggests that the sum-mary can be packaged into multiple layers, with each layeras the summary difference for two adjacent length scales.This fact suits for the scalability in the multimedia system,since a terminal can depend on its own needs or situationto receive how many layers without changing of the sum-marization process. Also, both the two summary contain

111

Figure 3: Recognized Concept

Figure 4: The 50% and 20% summarization results

complete concepts and thus preserve semantics effectively.It is natural for the 50% summary to include more detailssince the summary length allows.

5.2 Subjective EvaluationWe also carry out a subjective test of the human response

to the generated summaries. We adopt two metrics infor-mativeness and enjoyability proposed in [8] to quantify thequality of the summary under different summarization ra-tios. Enjoyability reflects user’s satisfactory of viewing ex-perience. Informativeness accesses the capability of main-taining content coverage while reducing redundancy.The subjective test is set up as follows. First, in case

of viewing tiredness of participants, we carefully picked fourtest videos: a 7-minute movie clip in“lord of the rings”(LoR)from the MUSCLE movie database, a 6-minute TV clip in“the big bang theory” (BBT), a 10-minute animation clipin “the big buck bunny” (BBB) and a 6-minute news clip in“CNN student news” (CNN). Then, we chose summarizationratios 50%, 30%, 20% and 10% respectively to consider boththe ordinary summarization cases and the extreme cases andfor LoR to test. We use both our approach and the scalableapproach in [3] to generate the summary.During the test, each participant was asked to give each

summary enjoyability and informativeness scores in percent-age ranging from 0%−100%. We collected scores submittedby 60 participants and calculated the means and variancesof the scores for the two approaches, as indicated in Table 1.From the scores to our approach, it shows that by reducing50% to 90% of the video content, the enjoyability and in-formativeness only drops less than 50%, which is an encour-aging result. Also, compared with the scalable approach,our scores in all informativeness and most enjoyability casesare slightly better. This results suggests our approach maypreserve more semantic information.

6. CONCLUSIONSIn this paper, we aimed at a summary that guarantees

viewers’ understanding of the video plot. We proposed a

E: enjoyability. I: informativeness.

Our Approach Scalable Approach [3]Mean Var. Mean Var

LoR E 64.05 18.21 62.88 19.82

50% I 66.08 18.71 65.50 18.67

BBT E 63.28 20.31 61.07 21.79

30% I 69.82 18.56 66.49 19.60

BBB E 61.25 21.86 62.30 20.32

20% I 68.00 16.90 67.50 17.26

CNN E 55.06 19.59 53.92 20.19

10% I 60.87 18.58 59.49 17.15

Table 1: The statistics of scores of four video clips

novel method to represent the semantic structure as con-cepts and instances and argued that a good summary main-tains all concepts complete, balanced and saliently illus-trated. We showed in the experiments the effectiveness ofthe proposed concept detection method. Also, the humansubjective test justifies the strength of the summarizationmethod. In the future work, we will refine the BoW featuresby conducting psychological research on relations of possiblevisual entities with viewers’ understanding and improve the“words” definition correspondingly.

7. REFERENCES[1] B. Chen, J. Wang, and et al. A Novel Video

Summarization Based on Mining the Story-Structureand Semantic Relations Among Concept Entities.IEEE Trans. Multimedia, 11(2):295–312, 2009.

[2] D. Dementhon and D. Doermann. VideoSummarization by Curve Simplification. In ACM Int’lConf. Multimedia, pages 211–218, 1998.

[3] L. Herranz and J. Martinez. A framework for scalablesummarization of video. IEEE Trans. Circuits Syst.for Video Tech., 20(9):1265–1270, 2010.

[4] Z. Li, G. M. Schuster, and et. al. MINMAX optimalvideo summarization. IEEE Trans. on Circuits Syst.for Video Tech., 15:1245–1256, 2005.

[5] D. Lowe. Object recognition from local scale-invariantfeatures. In IEEE Int’l Conf. Computer Vision, pages1150–1157, 1999.

[6] S. Lu and et al. Semantic Video Summarization UsingMutual Reinforcement Principle. In MultimediaModelling Conf., pages 60–67, 05.

[7] T. Lu, Z. Yuan, and et al. Video retargeting withnonlinear spatial-temporal saliency fusion. In IEEEInt’l Conf. Image Process., pages 1801–1804, 2010.

[8] Y. Ma, X. Hua, and et al. A generic framework of userattention model and its application in videosummarization. IEEE Trans. Multimedia,7(5):907–919, 2005.

[9] C. Ngo, Y. Ma, and et al. Video summarization andscene detection by graph modeling. IEEE Trans.Circuits Syst. for Video Tech., 15(2):296–305, 2005.

[10] U. Von Luxburg. A tutorial on spectral clustering.Statistics and Computing, 17(4):395–416, 2007.

[11] T. Wang, Y. Gao, and et al. Video summarization byredundancy removing and content ranking. In ACMInt’l Conf. Multimedia, pages 577–580, 2007.

112

[acm press the 10th international conference - beijing, china (2011.12.07-2011.12.09)] proceedings...

Documents