broadcast video navigation interface - decom-ufop · 2012-12-04 · broadcast video navigation...

Broadcast Video Navigation InterfaceBruno do Nascimento Teixeira, Julia Epischina Engracia de Oliveira,

Tiago Oliveira Cunha and Arnaldo de Albuquerque AraujoDepartment of Computer Science

Universidade Federal de Minas GeraisBelo Horizonte, Brazil

[email protected], [email protected],[email protected], [email protected]

Fillipe Dias Moreira de SouzaDepartment of Computer Science

University of South FloridaTampa, United States

[email protected]

Abstract—This paper describes a interface for video naviga-tion. It allows users to search into broadcast news videos usingdifferent approaches and organizes video segment into a timelinein a way that the users can visualize different aspects of the video.The navigation application consists of an interface of browsing,display and editing and provides a static and dynamic summa-rization. Video segmentation uses face detection using Viola-Jonesalgorithm and video classification into Indoor/Outdoor sceneswhich is based on SVM (Support Vector Machines). We aim toprovide an effective tool for the exhibition of data, allowing ahigh degree of interaction.

Keywords-Multimedia user interface and interaction, staticsummarization, dynamic summarization, face detection, videosegmentation, supervised learning, video browsing.

I. INTRODUCTION

Some works tackle the problem of turning video operationeasier for the final user. For instance, Forlines [1] describes avideo presentation system prototype based on visual content.The proposed prototype renders the frames in display regions,according to its content structure. This approach allows tokeep the continuity of the story while enhances the viewingexperience.

The difficulty of some users to navigate in web applicationsinterfaces is pointed by Komlodi et al. [2], specifically inhistories search systems. This raises the need to find solutionsthat can help these users in the navigation, and the proposalwas to identify design guidelines for these systems.

Regarding videos search, Su et al. [3] presented a methodfor content-based video search that uses patterns based in-dexing and combining techniques. The problem of the highdimensionality of the feature vectors can be solved withthis patterns based indexing, which also shows an efficientmethod to find the desired videos among a large amount ofmiscellaneous data.

The problem of efficient retrieval and indexing problem invideos databases is approached by Morand et al. [4]. In thiswork, a method for retrieving scalable objects in high resolu-tion videos is proposed, with the use of descriptors obtainedby statistical distributions of wavelet transform coefficients.

II. METHODOLOGY

In this section, the specific methodologies for each proposedalgorithm and interface development are presented.

A. Video Summarization

A video sequence normally contains a large number offrames. In order to ensure that humans do not perceive anydiscontinuity in the video stream, a frame rate of at least 25fpsis required. However, this volume of video data is a barrierto many practical applications, and therefore there is a strongdemand for a mechanism that allows the user to gain certainperspectives of a video data without watching the entire video.This mechanism is named video summarization.

Video summarization corresponds the process of extractinga summary of the content of a video, which aims to quicklyprovide concise information about the video content by pre-serving the original message. It complements the automaticvideo retrieval approach especially when content-based in-dexing and retrieval of video sequences have limited achieve-ment. Video summaries are presented as a solution to enableearly identification of video content in order to avoid the timespent in the task of manual analysis. The summary allowsthe user to judge whether the video is really relevant for theintended purpose, assisting him in making decisions [5].

There are two types of video abstracts: static and dynamicsummaries. Static summary consists of a set of images that areconsidered most representative of the video, while dynamicsummary is comprised of a set of video segments arranged intemporal order. In general, the latter is more attractive onceincorporates elements of movement and audio.

In this work we have used a method designed to besimple and effective at generating static abstracts [6]. Colorhistograms and line profiles are used to represent the videoimages, which accounts for the feature extraction.

After visual feature extraction, the images are grouped bythe k-means algorithm [7]. First, each cluster is representedby its most representative frame, then a set of key frames isobtained. In some cases, different key frames with very similarvisual content can be selected. In order to create the staticvideo summary, the key frames are filtered. With this filteringoperation, the number of key frames is reduced maintaining thequality of results and the key frames are arranged in temporalorder to facilitate the comprehension of the result.

Aiming to produce dynamic summaries, one additional stepwas incorporated to the static summary approach. At the end

of the process of static summary generation, it is performedthe shot boundary detection of the video. The shots thathave images presented in the static summary are selected toincorporate the dynamic summary and then presented as ashort video.

B. Segment redundancy

To remove redundancy between video segments, clusteringalgorithm is carried out. After the clustering process, visualwords histograms represent which similar segments belongto the same group. For representative segment selection,histogram distances between a group and its centroid arecalculated. The metric is based on Euclidean distance. Figure1 illustrates the clustering and selection of the most represen-tative segment processes.

Fig. 1. Clustering algorithm

C. Segment selection

For dynamic content generating, selection of the mostsignificant segments is required. It uses an assumption thatfor a higher level of activity in the segment, more informationis provided. In this work, as ([8], [9] and [10]), we usemotion measures for calculating the level of activity, wherethe average magnitude of motion vectors between frames of asegment is used as a measure.

To ensure that all summaries are in accordance with apredefined maximum size, the problem was modeled as thewidely known binary knapsack problem ([11]). Given a set ofn objects and a knapsack with:

• cj = cost j.• wj = weight j.• b = size.We determine which objects should be placed in the back-

pack to maximize the total benefit of such the weight of thebackpack not exceeding its capacity. Formally, the goal is:maximize z =

∑nj=1 cjsj ,∑n

j=1 wjsj ≤ b,

sj ∈ 0, 1Summaries are built by the union of the segments that have

the highest motion value. This union stills in accordance with

the preset maximum summary size. Thus, it was determinedthat the number of frames corresponding to the abstract wouldbe weight of the backpack and the amount of movementcorresponding to the benefit. Algorithm based on dynamicprogramming is used to solve the knapsack problem. Figure 2shows the home tab with dynamic summaries.

Fig. 2. Home with dynamic summaries.

D. Face Detection

The face detection is performed by the method of Viola-Jones [12], which is implemented in the OpenCV1 library.This method, for detecting objects in images, is based on fourconcepts: simple rectangular features called Haar features, acomplete picture for the early detection of features, the ma-chine learning method Adaboost [13], and a cascade classifierto combine efficiently the characteristics.

This method rescales the detector instead of the inputimage and runs the detector many times through the image,each time with a different size. Viola-Jones has devised ascale invariant detector that requires the same number ofcalculations whatever the size. This detector is constructedusing a called integral image and some simple rectangularfeatures of Haar. Weak classifiers are combined as a filterchain, which is especially efficient for the classifications ofthe regions of an image. Each filter is a classifier consistingof a weak Haar feature. The threshold for each filter is setsufficiently low so that all examples of faces, or nearly all, areclassified correctly. During the process, if any of these filtersfail to classify an image region, this region is immediatelyclassified as ”non face.” When a region of the image passesthrough a filter, it goes to the next filter in the chain, andthen image regions that pass through all filters in the chainare classified as face.

E. Classification into Indoor/Outdoor Scenes

One feature of the application is the automated recognitionof indoor and outdoor scenes in news broadcast videos. Thegoal of this task is to facilitate users searching for the contentof interest in voluminous video databases. Separating indoorscenes from outdoor scenes can be seen as a preprocessing

1http://sourceforge.net/projects/opencvlibrary

Fig. 3. Face navigation. Face detection is based on the method of Viola-Jones [12] available in OpenCV Library.

step of a hierarchy-based search system. To illustrate, we canassume that it is easier to find clips of highlights of the SoccerWorld Cup matches if we just consider the search in the set ofoutdoor scenes than when using the whole dataset. That is, itsuffices to work with only one selected sample of the availabledata.

As outdoor scenes are separated, a set of static imagesrepresenting the outdoor scenes of the news broadcast videois displayed in an independent tab of the system. This tabis intended to provide users with an alternative means toeasily look for the content of interest. The user would be ableto visually verify if any theme of interest is present in thatparticular news broadcast video (see Figure 4) .

1) Descriptor: In our particular case, we denote as indoorscenes those where the anchor reads the news inside the studio,whereas the outdoor scenes are those where interviews andpresentations are performed in an outer setting. Naturally,color seems to be a distinguishable feature to differ innerfrom outer environments, even though shape description stillplays its role. For this reason, we have chosen to use acolor shape based representation to characterize the videoscenes. Sande et al. [14] has proposed many color shape baseddescriptors, which have performed nicely to object and scenecategorization, being one of them the HueSIFT. HueSIFT isa combination of the state-of-the-art local feature detectorSIFT (to static images) with hue histograms. In this case, theneighborhood of each local feature is described in terms of hue(from the HSI color space) and its histogram is concatenatedto histograms of oriented gradients (shape description).

2) Bag-of-Visual-Words Representation: Given a video clip,a higher level representation for its set of local features can beprovided by the bag-of-feature approach. In this approach, a

visual codebook is constructed by using a clustering algorithm(as k-means [7]) over a sample of the training dataset, andeach cluster accounts for a visual word. Local features of animage are then assigned to the closest visual word (we usedthe Euclidean distance function). As a result, a histogram ofvisual word counts is built to represent the set of local featuredescribing a certain image.

3) Classification and Performance: We have used SupportVector Machines (SVM) [15] to classify the videos. For beinga binary problem, we decided to use the linear kernels, whichhas showed good results. The LibSVM [16] implementationwas used in the experiments, which has been successfullyapplied to many applications in the literature.

To validate scene classification method we tested the per-formance through a 5-fold cross-validation approach. In thisscheme, a sample of frames is extracted from each video shotof both types. The whole set is uniformly split into 5 foldssuch that each fold is weighed in terms of the number ofshots considering class. This way, each fold is tested havingas the training set the union of the remaining folds, that is, iffold0 is the test set, then the training set is the concatenation offold1, fold2, fold3, and fold4. Table I presents the performanceevaluation for each fold in terms of frames, while Table IIshows the performance evaluation in terms of shot, which isour final target, as we want to classify shots.

In order to compute the performance in terms of shot, wefirst calculate the number of indoor and outdoor frames of ashot and assign the class label according to a majority votingscheme. If the shot was correctly labeled, then we incrementthe hit score for that particular class. The overall performanceis given in Table III, which demonstrates that the employedmethod is efficient to the intended application context.

Fig. 5. Shot navigation. Timeline (bottom) allows users to drag shots of interest.

TABLE ICLASSIFICATION PERFORMANCE IN TERMS OF THE SAMPLING OF

FRAMES. #inframe AND #outframe DENOTE THE NUMBERS OF INDOOR ANDOUTDOOR SCENE FRAMES, RESPECTIVELY. AS FOLLOWS, inclass AND

outclass CORRESPOND TO THE CLASSIFICATION PERFORMANCES OF THEINDOOR AND OUTDOOR FRAME SETS, RESPECTIVELY.

fold #inframe #outframe inclass outclassfold0 416 281 88.94% 84.34%fold1 421 239 96.67% 92.05%fold2 523 231 86.81% 83.12%fold3 401 179 94.76% 89.38%fold4 445 252 73.03% 85.71%

TABLE IIPERFORMANCE OF THE SHOT CLASSIFICATION. #inshot AND #outshot

DENOTE THE NUMBER OF INDOOR AND OUTDOOR SHOTS, RESPECTIVELY.AS FOLLOWS, inclass AND outclass CORRESPOND TO THE CLASSIFICATION

PERFORMANCE OF THE INDOOR AND OUTDOOR SHOT SETS,RESPECTIVELY.

fold #inshot #outshot inclass outclassfold0 46 40 86.96% 92.5%fold1 46 40 97.83% 92.5%fold2 46 40 89.13% 87.5%fold3 46 40 91.3% 90%fold4 45 40 73.33% 90%

III. INTERFACE

The interface consists of three main areas: browsing, dis-play, and editing which can facilitate the user to both searchfor a specific subject inside a video and to organize his resultseffectively into a timeline. The home screen displays dynamicsummaries of all dataset (see Figure 2).

The browsing area presents the results of the algorithmsthrough images. These images can be clicked for player

TABLE IIICONFUSION MATRIX TABLE.

outdoor indooroutdoor 90% 10%indoor 11.11% 88.89%

exhibition and also they can be dragged to the editing area.The displaying area contains a player with simple reproduc-

tion buttons which can be resized without changing the video’saspect ratio.

Finally, the editing area presents a timeline initially emptythat can be filled by dragging the available segments in thebrowsing area. When the play button is clicked, the segmentsare played sequentially and without gaps. Also, the segmentscan be easily rearranged on the timeline, forming a new editionfor the video.

Besides these areas, there is a top menu on which the usercan register himself, choose a new video to watch or send anew video to be processed.

The interface is a layer completely isolated from the videoprocessing module. Both interact only through a database. Thedatabase used is MySQL2 and data are manipulated by PHP,which is a language particularly suited for Web development,compound the server side of the application. The client-sideis Flash based, using languages MXML and Action Script 3.

Figure 6 shows a high-level overview of our content-based video analysis prototype. We implemented a Web-basedsystem, since this is the most effective way to allow realusers to interact with the system. The internal architecturefollows the most common models of communication between

2http://www.mysql.com/

(a)

(b)

Fig. 4. The web interface provides a segmentation based on Indoor (a) /Outdoor scenes (b). Blue scenario characterizes indoor scenes.

the modules of a Web application.

Fig. 6. Operation diagram of the proposed system.

Basically, a video submitted by a user is processed usingthe algorithms available on the system. After that, the userscan navigate through this particular video or another one fromthe system database using a proposed navigation mechanism.Figures 5 and 3 ilustrate a visual web interface for video newsbrowsing which contains flash player, timeline (allows usersto drag shots of interest) and the video frames representingthe shots.

IV. CONCLUSION

We have proposed a video navigation interface to improvethe effectiveness of information access in broadcast videos. Itconsists of static and dynamic summarization, face detectionand classification of Indoor and Outdoor scenes. The face de-tection is performed by the method of Viola-Jones. Indoor andOutdoor classification based on SVM demonstrates efficientto the intended application context. It uses HueSIFT as localfeature detector due to the blue pattern presented in indoorscenes. Static and dynamic summarization algorithms use acombination of line profile and color histogram to representframes and k-means clustering algorithm.

We are interested in several directions in the future. We seekan approach for scene and video indexing using speech andvisual content and event recognition based on multi-modalityanalysis.

ACKNOWLEDGMENT

The authors would like to thank CNPq, CAPES andFAPEMIG for supporting this work.

REFERENCES

[1] C. Forlines, “Content aware video presentation on high-resolutiondisplays,” in Proceedings of the working conference on Advanced visualinterfaces, ser. AVI ’08. New York, NY, USA: ACM, 2008, pp. 57–64.[Online]. Available: http://doi.acm.org/10.1145/1385569.1385581

[2] A. Komlodi, G. Marchionini, and D. Soergel, “Search historysupport for finding and using information: User interface designrecommendations from a user study,” Information Processing &Management, vol. 43, no. 1, pp. 10–29, Jan. 2007. [Online]. Available:http://dx.doi.org/10.1016/j.ipm.2006.05.017

[3] J.-H. Su, Y.-T. Huang, H.-H. Yeh, and V. S. Tseng, “Effectivecontent-based video retrieval using pattern-indexing and matchingtechniques,” Expert Syst. Appl., vol. 37, no. 7, pp. 5068–5085, Jul.2010. [Online]. Available: http://dx.doi.org/10.1016/j.eswa.2009.12.003

[4] C. Morand, J. Benois-Pineau, J. P. Domenger, J. Zepeda, E. Kijak,and C. Guillemot, “Scalable object-based video retrieval in hd videodatabases,” Image Commun., vol. 25, no. 6, pp. 450–465, Jul. 2010.[Online]. Available: http://dx.doi.org/10.1016/j.image.2010.04.004

[5] S. Uchihashi, J. Foote, A. Girgensohn, and J. Boreczky, “Video manga:Generating semantically meaningful video summaries.” ACM Press,1999, pp. 383–392.

[6] S. E. F. de Avila, A. d. Jr., A. de A. Araujo, and M. Cord, “Vsumm:An approach for automatic video summarization and quantitativeevaluation,” in Proceedings of the 2008 XXI Brazilian Symposiumon Computer Graphics and Image Processing, ser. SIBGRAPI ’08.Washington, DC, USA: IEEE Computer Society, 2008, pp. 103–110.[Online]. Available: http://dx.doi.org/10.1109/SIBGRAPI.2008.31

[7] J. B. Macqueen, “Some methods of classification and analysis of mul-tivariate observations,” in Proceedings of the Fifth Berkeley Symposiumon Mathematical Statistics and Probability, 1967, pp. 281–297.

[8] C.-M. Pan, Y.-Y. Chuang, and W. H. Hsu, “Ntu trecvid-2007 fastrushes summarization system,” in Proceedings of the internationalworkshop on TRECVID video summarization, ser. TVS ’07. NewYork, NY, USA: ACM, 2007, pp. 74–78. [Online]. Available:http://doi.acm.org/10.1145/1290031.1290045

[9] R. Laganiere, R. Bacco, A. Hocevar, P. Lambert, G. Paıs, and B. E.Ionescu, “Video summarization from spatio-temporal features,” inProceedings of the 2nd ACM TRECVid Video Summarization Workshop,ser. TVS ’08. New York, NY, USA: ACM, 2008, pp. 144–148.[Online]. Available: http://doi.acm.org/10.1145/1463563.1463590

[10] N. Putpuek, D.-D. Le, N. Cooharojananone, S. Satoh, andC. Lursinsap, “Rushes summarization using different redundancyelimination approaches,” in Proceedings of the 2nd ACMTRECVid Video Summarization Workshop, ser. TVS ’08. NewYork, NY, USA: ACM, 2008, pp. 100–104. [Online]. Available:http://doi.acm.org/10.1145/1463563.1463581

http://doi.acm.org/10.1145/1385569.1385581

http://dx.doi.org/10.1016/j.ipm.2006.05.017

http://dx.doi.org/10.1016/j.eswa.2009.12.003

http://dx.doi.org/10.1016/j.image.2010.04.004

http://dx.doi.org/10.1109/SIBGRAPI.2008.31

http://doi.acm.org/10.1145/1290031.1290045

http://doi.acm.org/10.1145/1463563.1463590

http://doi.acm.org/10.1145/1463563.1463581

[11] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introductionto Algorithms, 2nd ed. McGraw-Hill Higher Education, 2001.

[12] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in 2001 IEEE Computer Society Conference onComputer Vision and Pattern Recognition, vol. 1. Los Alamitos, CA,USA: IEEE Comput. Soc, Apr. 2001, pp. 511–518. [Online]. Available:http://dx.doi.org/10.1109/CVPR.2001.990517

[13] Y. Freund and R. E. Schapire, “A decision-theoretic generalization ofon-line learning and an application to boosting,” in Proceedings of theSecond European Conference on Computational Learning Theory, ser.EuroCOLT ’95. London, UK, UK: Springer-Verlag, 1995, pp. 23–37.[Online]. Available: http://dl.acm.org/citation.cfm?id=646943.712093

[14] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek, “Evaluationof Color Descriptors for Object and Scene Recognition,” in IEEEConference on Computer Vision and Pattern Recognition, Anchorage,Alaska, USA, 2008. [Online]. Available: http://staff.science.uva.nl/∼ksande/pub/vandesande-cvpr2008.pdf

[15] V. N. Vapnik, The nature of statistical learning theory. New York, NY,USA: Springer-Verlag New York, Inc., 1995.

[16] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vectormachines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp.27:1–27:27, May 2011. [Online]. Available: http://doi.acm.org/10.1145/1961189.1961199

http://dx.doi.org/10.1109/CVPR.2001.990517

http://dl.acm.org/citation.cfm?id=646943.712093

http://staff.science.uva.nl/~ksande/pub/vandesande-cvpr2008.pdf

http://staff.science.uva.nl/~ksande/pub/vandesande-cvpr2008.pdf

http://doi.acm.org/10.1145/1961189.1961199

http://doi.acm.org/10.1145/1961189.1961199

broadcast video navigation interface - decom-ufop · 2012-12-04 · broadcast video navigation...

Documents