the dynamic videobook: a hierarchical summarization for...

THE DYNAMIC VIDEOBOOK: A HIERARCHICAL SUMMARIZATION FORSURVEILLANCE VIDEO

Lei SUN, Haizhou AI

Computer Science and Technology DepartmentTsinghua University

Beijing, 100084, [email protected]

Shihong LAO

Development CenterOMRON Social Solutions Co., LTD

Kyoto 619-0283, [email protected]

ABSTRACTThis paper proposes a novel hierarchical summarization sys-tem for surveillance video in which synopsis video and orig-inal video are at the lower layer and dynamic collage at thehigher layer. A user can efficiently browse the video con-tent through dynamic zooming into collage images and fur-ther skimming synopsis video or original video for details,enabling an interactive interface for video navigation. We callthis system Dynamic VideoBook because its structure is akinto a book, that is, dynamic collage images resemble the cov-er and the contents while synopsis video gives an abridgededition of the video. Unlike previous collage techniques forfilms or entertainment videos, collage for surveillance videois challenging since such video imposes restrictions on spa-tiotemporal relationship and focus on multiple foreground ob-jects. So we put forward a new collage method which betterconveys the spatiotemporal motion and structurally extract-s the visual contents from the video simultaneously. And wealso enable dynamic zooming in to reveal more information inreal-time. In addition, by mapping the upper layer of sketchyvisualization to the lower layer of visual data, this system canefficiently explore and browse the video content, highlightone motion or interactively search one specific fragment.

Index Terms— Video summarization, surveillance video,dynamic video collage

1. INTRODUCTION

Video summaries provide compact, coherent and succinctrepresentations of a video’s content such as video collage[1], video synopsis [2], video narratives [3], tapestry [4] andstorylines [5]. Since explosive growth of surveillance videodata presents formidable challenges to its browsing, retrievaland storage, video summarization is of significance to relievehuman from laborious manual browsing and examining. Ofexisted work, video synopsis [2] has proved to be an effectivemethod to summarize the video in a compact and coherentmanner. Unlike video synopsis, video collage [1] and relatedsimilar work [3–5], commonly applied in movies or enter-tainment videos with one or several characters and varying

background, is not quite well suited for surveillance videowhere multiple targets are solely or collectively moving.In such scenarios, collage method aids to compactness, butconveys neither the flow of time nor the varying of motion.The most similar applicable scene to our work is [3], whichdynamically revealed the motion of one or several actors.Unfortunately, it cannot work well in the case of occludingmotion which occurs often in surveillance videos.

Map

Original video (Book content)

... ... ...

Synopsis video (Abridged edition)

Synopsis layer (Visual data)

Collage layer (Visual cover and contents)

Dynamically zoom-in

Map

Fig. 1: The Dynamic VideoBook: Organize and summarize avisual book

We observe that considerable efforts have been invested inthe summarization technique while very little attention is paidto organizing the video content that is quite useful in videosummarization and navigation, e.g., the character-story orga-nization of movie helps the summary to eliminate dispensableplots. Another observation is that hierarchical collage incor-porated with synopsis provides a better user interface. Basedon these we present a novel hierarchical summarization sys-tem called Dynamic VideoBook. It assumes that the surveil-lance video can be organized as a visual book with one visu-al cover, visual contents and visual pages and consequentlythe video would be hierarchically structured and efficientlyindexed by the contents. Under this assumption, the systemconsists of two layers illustrated in Figure 1, of which theupper one is collage layer to extract visual contents and thelower is synopsis layer to summarize visual data. We adoptour previous work [6] to do online video synopsis to summa-rize general motions, during which foreground object track-

3963978-1-4799-2341-0/13/$31.00 ©2013 IEEE ICIP 2013

s are extracted, recorded and selectively arranged in a morecondensed way. Meanwhile, video collage can offer an ex-plicit comprehensible understanding of the video. However,the common collage technique selecting region of interests(ROIs) and then blending them together impacts the motionintegrity and ignores the spatiotemporal relation of surveil-lance videos. So we present a well motion conveying collagemethod to structurally extract visual contents. And, we al-so enable dynamic browsing by zooming into the collage inreal-time. Moreover, in order to efficiently search or indexthe visual book, we build the mappings from collage layer tosynopsis layer. As showed in Figure 1, the upper image in thecollage layer is the overall collage image treated as the visu-al cover composed of three images boxed with blue regardedas the visual chapters. Users can review more information ofeach chapter by zooming in and then obtain the sub chaptersas depicted in the lower image of collage layer. By contin-uously browsing the entire visual contents would be gradu-ally formed. And for more details in the yellow rectangle,users can efficiently index to synopsis layer for correspond-ing synopsis sequences and original video sequences. Withthis hierarchical design, our system is not only effective invideo navigation by dynamic video collage but also efficientin video searching and indexing by mapping to synopsis ororiginal video.

2. SYSTEM OVERVIEW

Given a surveillance video, we aim at establishing com-pact yet comprehensible video collage enabling zoom-in andzoom-out, synopsis video and mappings from collage layerto synopsis layer. Our synopsis [6] process summarizes thevideo content in real-time. Then the collage process wouldbegin subsequently based on the object tracks extracted dur-ing synopsis.

Segment

Compose

Select pROIs, detailed patches

Original video stream

Click

Collage

Dynamic

collage

Synopsis

video streamOriginal

video

stream

Original videoSynopsis video

Build mappings

Fig. 2: System overview (red rectangle: pROI, green rectan-gle: detailed patches)

In order to make descriptive collage and concerning thatsimply blending small ROIs would generally ruin the ensem-ble motions and spatiotemporal relations, we adopt the mech-anism that assembles larger and representative ROIs (pROIs)into several images appropriately that best describe the video

plot called plot images (red rectangle in Figure 1) and se-lectively stitch smaller ROIs called detailed patches (yellowrectangle in Figure 1) around in temporal order to generatea collage image. We believe such mechanism facilitates therepresentation of general motions by plot images and detaileddescription by detailed patches.

The system framework is showed in Figure 2. We firstsegment the video into several volumes, extract pROIs of eachvolume subsequently and then assemble pROIs to plot im-ages. The final collage is obtained by blending plot imagesand selected detailed patches in a storyboard layout. Lastbut not least, we build the mappings. And if users requestto thoroughly explore into the collage, we iteratively processthe above procedure by treating the corresponding depictedvolume as a new video, enabling dynamic video navigation.

3. DYNAMIC VIDEO COLLAGE

Given a video sequence V containing M frames {Ii}Mi=1 andN foreground objects {Oj}Nj=1 obtained from synopsis pro-cess, eachOj contains a set of variablesOj=(f js ,f

je ,T

j ,Wj),where f js and f je are the start and end frame number of thisobject, Tj is a set of tracks and Wj is a set of the corre-sponding weight of each track. We segment V to {Vk}Kk=1,select pROIs {R} and detailed patches {P} and composethem to a collage C. Let λ be one feasible solution λ =({Vk}Kk=1,{R},{P}), we formalize the video collage as mini-mize E(λ) process:

E(λ) = ω1Eseg(λ) + ω2Erep(λ) + ω3Ecomp(λ) (1)

where Eseg(λ) denotes the segmentation cost, Erep(λ) de-notes representation cost caused by pROIs and detailed patch-es selection and Ecomp(λ) denotes the composition cost in as-sembling pROIs and detailed patches, ω1, ω2 and ω3 are pre-defined to control the relative strength of each energy term.We solve this minimization problem in a step-wise manner.

3.1. Segmentation cost Eseg(λ)

A fine segmentation should enlarge the within-volume simi-larity and decrease the between-volume similarity. We mea-

sure the similarityS(Vi,Vj)=

N∑l=1

min(wil , w

jl ), where wi

l is

the weight sum of tracks of object Ol appeared in Vi. In thisway, we define Eseg(λ) as following:

Eseg(λ)=

K∑k=1

S(Vk, Vl)

min(S(Vk, Vk), S(Vl, Vl))(2)

where Vl is the neighbor volume of Vk.3.2. Representation cost Erep(λ)

The representation cost is associated with to what degree theselected pROIs and patches represent the volume. Taking thepreservation of spatiotemporal relation, objects and motion-s into account, we define the cost as a combination of eachterm.

3964

Erep(λ) =

K∑k=1

(αA(Vk) + βB(Vk) + γD(Vk)) (3)

where A(Vk), B(Vk) and D(Vk) denote the spatiotemporalcost, object cost and motion cost, respectively, and α, β andγ measure their relative importance.

As the spatiotemporal relation and foreground objects arestrongly emphasized, we construct candidate pROIs such thatthe region in small volume has foreground pixels and in themeantime it is distant from other regions, ensuring small mo-tions and clustered objects preserved in pROIs while unde-sired background eliminated. In practical implementation,for each frame we cluster the foreground regions of neigh-bor small volume (e.g., 30 frames) into pROIs and region-s of smaller volume (e.g., 10 frames) into detailed patches.Given candidate pROIs and detailed patches, A(Vk) can bemeasured as the weighted sum of clustering moving object-s missed in the selected pROIs, B(Vk) is calculated by theamount of lost objects while D(Vk) computes the weightedsum of lost tracks of each object. In each volume, we select apredefined number of pROIs and detailed patches. We test ourapproach on sequence S2L1 from VS-PETS 2009 data set.1

The three selected pROIs and detailed patches of the first seg-mented volume are depicted in Figure 3 (a) and (b), respec-tively. It can be easily perceived that there exist 7 pedestriansmoving of which two coupled walked, three discussed for awhile and one in blue cloth walked through. In addition, giv-en the pROIs the detailed patches reveal more information.

(a)

(b)

(c)

Fig. 3: (a) The extracted pROIs, (b) extracted detailed patch-es, (c) the plot image obtained from (a) and (b)

3.3. Composition cost Ecomp(λ)

The composition cost penalizes for the weak compositionquality of pROIs {R} and detailed patches {P}. To makethe collage comprehensible, it is critical to make sense ofthe composed plot image. Particularly for surveillance video,we expect general motions kept intact and groups well pre-served. However, there are often occluding situations fromobjects of different frames which results in unpleasant visualexperience in collage by stitching objects from each pROIwith respect to their spatial locations. In contrast, simply

1http://www.cvg.rdg.ac.uk/PETS2009

blending pROIs together avoids the occlude occasion butsuffers from the loose structure and implicit motions. Con-cerning the above, we adopt the mechanism that if two pROIsshare a certain number of objects we blend them else wesimply concatenate them together. If small regions intersectwe resize plot image (foreground object areas from pROIsare retained) using seamless carving [7] and selectively stitchobjects with slight deviations subsequently, or else if largeregions occlude we concatenate occluding area using graphcuts. In this way, we give the composition cost of {R} iscomputed byCr({R})=L({R})+ Area(Ip)

Area({R}) , where L({R})denotes the weighted sum of lost objects and the deviationsfrom original positions, Area(.) denotes the area size andIp denotes the plot image, respectively. Arranging {P} isquite simple: resizing and blending in temporal order. As aconsequence, the composition cost of {P} is computed byCp({P})=

∑(1 − Area(P̂ )

Area(P ) ) where P̂ denotes the resizedimage of P . So the overall composition cost can be describedas the following function

Ecomp(λ) =

K∑k=1

(Cr({R}) + Cp({P})) (4)

Since it is expected to show the temporal relation thathelps understanding the flow of time and conveying the mo-tions, there is a falloff for each object that makes older objectsmore transparent and highlights the newer ones in the plot im-age, as illustrated in Figure 3 (c).

3.4. Dynamic browseTo better explore the video content, we enable dynamicbrowsing by clicking the plot images, which would iterative-ly minimize the energy and generate a new detailed collageimage containing its own plot images and detailed patchessubsequently. However, the collage must be showed andzoomed in or out in real-time, which imposes certain costlimitations in the video processing. Given the energy def-inition, the first and second term can be pre-computed, sothe segmentation results as well as the selected pROIs anddetailed patches can hence be stored. When users request forzooming in, the corresponding pROIs and detailed patchesare composed appropriately with respect to the third term ofenergy function. In our implementation, it costs averagelythree seconds for generating a new collage.

4. DYNAMIC VIDEOBOOK

Video collage extracts succinct and explicit visual contentsfor visualization while video synopsis condenses informativevideo content for quick browsing. However, neither of them isworkable independently to support users to efficiently searchor index. So we establish a hybrid system with dynamic col-lage as the upper layer and original video as well as synopsisvideo as the lower layer. Building the mappings from the col-lage layer to synopsis layer is just like linking the contents tobook texts, which obviously improves the searching efficien-cy. In detail, we map each plot image and detailed patches

3965

(a)

(b) (c)Fig. 4: (a) Collage result using Wang et. al.’s method, (b) our synopsis sequences, (c) our dynamic collage: the lower image isthe collage of the first plot image from the upper collage

to one certain time span of synopsis video as well as originalvideo, and we also map each synopsis frame to one certaintime span of the original video. In this way, if a user sees aninterested object showed in the collage image, it would be alot easier and quicker to check its sketchy motion from thesynopsis video. For more details the correlated original videosequence can be quickly indexed and browsed by mappingeither through the synopsis or the collage.

With this design, the upper layer of the Dynamic Video-Book takes charge of generally dynamic interactive browsing,the lower layer summaries the content while the hierarchicaldesign enables efficient search and indexing. Users can alsohighlight or search specific objects at the collage panel. Asshowed in Figure 1, the patches boxed with yellow can bequickly indexed to synopsis and original sequences.

5. EXPERIMENTS

We test our system on a public surveillance video, and we alsocompare our collage method with the Wang et. al.’s method[1]. As illustrated in Figure 4, (a) gives the collage result us-ing Wang et. al.’s method, which selects ROIs and then blendsthem together. (b) shows three frames of generated synop-sis video in which the objects are coherently and densely ar-ranged. In (c), the upper image is the overall collage imageusing our collage method and the lower image is the dynam-ic collage depicting the zoomed-in area of the upper collage,in which the motions are more explicitly depicted. It is quiteclear that the collage image using our method is more com-prehensible than the traditional method [1], which we believebenefits from the plot image with motion well preserved andspatiotemporal relation revealed. We can easily perceive thatpeople are walking in some certain area from our collage im-age. In contrast, the traditional collage images confuse user-s for ignoring the spatiotemporal information. In addition,through this hierarchical system we can highlight objects likethe two boxed in yellow in the lower collage in (c), the cor-responding pedestrians have been boxed in red through thecollage and can be indexed to relative synopsis sequences.

6. CONCLUSIONS

In this paper, we present a hierarchical video summariza-tion system particularly for surveillance video: The DynamicVideoBook, which incorporates video collage structurallyextracting general visual contents and video synopsis sum-marizing informative motions. Since the common used videocollage methods do not work well in the surveillance video,we put forward a new dynamic video collage method, whichmakes the result more comprehensible. In addition, the col-lage can be dynamically zoomed in to get details. And havingthe contents of the visual book and visual data including thesynopsis video and original video, we hierarchically buildmappings from the contents to visual data for efficient inter-active searching or indexing.

7. ACKNOWLEDGEMENT

This work is supported by National Science Foundation ofChina under grant No.61075026, and it is also supported by agrant from Omron Corporation

8. REFERENCES

[1] T. Wang, T. Mei, X.-S. Hua, X. Liu, and H.-Q. Zhou, “Videocollage: A novel presentation of video sequence,” in ICME.IEEE, 2007.

[2] Y. Pritch, A. Rav-Acha, and S. Peleg, “Nonchronological videosynopsis and indexing,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 30, no. 11, pp. 1971–1984, 2008.

[3] C. D. Correa and K. L. Ma, “Dynamic video narratives,” ACMTrans Graph, vol. 29, no. 4, 2010.

[4] C. Barner, D. B. Goldman, E. Shechitman, and A. Finkelstein,“Video tapestries with continuous temporal zoom,” in SIG-GRAPH. ACM, 2010.

[5] T. Chen, A. D. Lu, and S. M. Hu, “Visual storylines: Semanticvisualization of movie sequence,” Computer and Graphics, vol.36, pp. 241–249, 2012.

[6] L. Sun, J. Xing, H. Ai, and S. Lao, “A tracking based fast onlinecomplete video synopsis approach,” in ICPR. IEEE, 2012.

[7] S. A. Avidan and A. Shamir, “Seamless carving for content-aware image resizing,” in SIGGRAPH. ACM, 2007.

3966

the dynamic videobook: a hierarchical summarization for...

Documents