content-based multimedia indexing and retrieval...

9
Taking into consideration sports videos’ unique qualities, we propose a system that semantically annotates them at different layers of semantic significance, using different elements of visual content. We decompose each shot into its visual and graphic content elements and, by combining several different low-level visual primitives, capture semantic content at a higher level of significance. T he quantity of videos generated by digital technologies has increased the need for automatically annotating video content and for techniques that support retrieving that content. Content- based video annotation and retrieval therefore has become an active research topic. Although we can successfully apply many of the results in content-based image retrieval to videos, addi- tional techniques are necessary to address videos’ unique qualities. For example, videos add the temporal dimension, requiring object dynamics. Furthermore, although people often think of a video as just a sequence of images, it’s actually a compound medium, integrating diverse media such as realistic images, graphics, text, and audio. 1 Also, application contexts for videos are different than those for images and therefore call for different approaches to help users annotate, query, and exploit archived video data. The huge amount of data that video streams deliver necessarily calls for higher levels of abstraction when we annotate content. This therefore requires us to investigate and model video semantics. Because of the type and volume of data, general-purpose approaches are likely to fail since semantics inherently depend on a spe- cific application context. Many researchers have addressed semantic modeling of content in mul- timedia databases. Researchers have also report- ed on concrete video retrieval applications by high-level semantics in specific contexts such as movies, news, and commercials. 2,3 Due to their enormous commercial appeal, sports videos represent an important application domain. However, most research efforts so far have been devoted to characterizing single, specific sports. (For example, Miyamori and Iisaku 4 pro- posed a method for annotating videos according to human behavior; Ariki and Sygiyama 5 proposed a method for classifying TV sports news videos using discrete cosine transform [DCT] features; and Zhou et al. 6 classified nine basketball events using color features, edges, and MPEG motion vectors.) We propose an approach for semantic anno- tation of sports videos that include several dif- ferent sports and even nonsports content. We automatically annotate videos according to ele- ments of visual content at different layers of semantic significance. In fact, we primarily dis- tinguish studio and interview shots from sports action shots and then further decompose the sports videos into their main visual and graphic content elements, including sport type, fore- ground versus background, text captions, and so on. We extract relevant semantic elements from videos by combining several low-level visual primitives such as image edges, corners, seg- ments, curves, and color histograms, according to context-specific aggregation rules. In this article, we illustrate three modules of our system, which performs semantic annotation of sports videos at different layers of semantic signif- icance, using different elements of visual content. The application context The actual architecture of a system supporting video annotation and retrieval depends on the application context and, in particular, on end users and their tasks. Although all application contexts demand a reliable annotation of the video stream to effectively support selection of relevant video segments, it’s evident that, for instance, service providers (such as broadcasters and editors) and consumers accessing a video-on- demand service have different needs. 7 For both the old and new media, automatic annotation of video material opens the way for economically exploiting valuable assets. In par- ticular, in the specific context of sports videos, two logging approaches exist, which let broad- casting companies reuse recorded material: posterity logging, where librarians add detailed and standardized annotation to archived material, and production logging, where (assistant) producers annotate live feeds or footage recorded a few 52 1070-986X/02/$17.00 © 2002 IEEE Semantic Annotation of Sports Videos Jürgen Assfalg, Marco Bertini, Carlo Colombo, and Alberto Del Bimbo University of Florence, Italy Content-Based Multimedia Indexing and Retrieval

Upload: others

Post on 13-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

Taking intoconsideration sportsvideos’ uniquequalities, wepropose a systemthat semanticallyannotates them atdifferent layers ofsemanticsignificance, usingdifferent elements ofvisual content. Wedecompose eachshot into its visualand graphic contentelements and, bycombining severaldifferent low-levelvisual primitives,capture semanticcontent at a higherlevel of significance.

The quantity of videos generated bydigital technologies has increased theneed for automatically annotatingvideo content and for techniques

that support retrieving that content. Content-based video annotation and retrieval thereforehas become an active research topic. Althoughwe can successfully apply many of the results incontent-based image retrieval to videos, addi-tional techniques are necessary to address videos’unique qualities. For example, videos add thetemporal dimension, requiring object dynamics.Furthermore, although people often think of avideo as just a sequence of images, it’s actually acompound medium, integrating diverse mediasuch as realistic images, graphics, text, andaudio.1 Also, application contexts for videos aredifferent than those for images and therefore callfor different approaches to help users annotate,query, and exploit archived video data.

The huge amount of data that video streamsdeliver necessarily calls for higher levels ofabstraction when we annotate content. Thistherefore requires us to investigate and modelvideo semantics. Because of the type and volumeof data, general-purpose approaches are likely tofail since semantics inherently depend on a spe-cific application context. Many researchers haveaddressed semantic modeling of content in mul-timedia databases. Researchers have also report-ed on concrete video retrieval applications byhigh-level semantics in specific contexts such asmovies, news, and commercials.2,3

Due to their enormous commercial appeal,sports videos represent an important application

domain. However, most research efforts so far havebeen devoted to characterizing single, specificsports. (For example, Miyamori and Iisaku4 pro-posed a method for annotating videos accordingto human behavior; Ariki and Sygiyama5 proposeda method for classifying TV sports news videosusing discrete cosine transform [DCT] features; andZhou et al.6 classified nine basketball events usingcolor features, edges, and MPEG motion vectors.)

We propose an approach for semantic anno-tation of sports videos that include several dif-ferent sports and even nonsports content. Weautomatically annotate videos according to ele-ments of visual content at different layers ofsemantic significance. In fact, we primarily dis-tinguish studio and interview shots from sportsaction shots and then further decompose thesports videos into their main visual and graphiccontent elements, including sport type, fore-ground versus background, text captions, and soon. We extract relevant semantic elements fromvideos by combining several low-level visualprimitives such as image edges, corners, seg-ments, curves, and color histograms, accordingto context-specific aggregation rules.

In this article, we illustrate three modules of oursystem, which performs semantic annotation ofsports videos at different layers of semantic signif-icance, using different elements of visual content.

The application contextThe actual architecture of a system supporting

video annotation and retrieval depends on theapplication context and, in particular, on endusers and their tasks. Although all applicationcontexts demand a reliable annotation of thevideo stream to effectively support selection ofrelevant video segments, it’s evident that, forinstance, service providers (such as broadcastersand editors) and consumers accessing a video-on-demand service have different needs.7

For both the old and new media, automaticannotation of video material opens the way foreconomically exploiting valuable assets. In par-ticular, in the specific context of sports videos,two logging approaches exist, which let broad-casting companies reuse recorded material:

❚ posterity logging, where librarians add detailedand standardized annotation to archivedmaterial, and

❚ production logging, where (assistant) producersannotate live feeds or footage recorded a few

52 1070-986X/02/$17.00 © 2002 IEEE

SemanticAnnotation ofSports Videos

Jürgen Assfalg, Marco Bertini, Carlo Colombo, and Alberto Del Bimbo

University of Florence, Italy

Content-Based Multimedia Indexing andRetrieval

Page 2: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

hours before to edit a program that will bebroadcast within a short time frame.

An example of reusing footage logged for poster-ity is selecting the best footage of a famous ath-lete to provide a historical context for a recentevent. An example of production logging is select-ing highlights, such as soccer goals or tennis-match points, to produce programs that containone day’s best sports actions.

In both scenarios, we should be able to auto-matically annotate video material, which is typi-cally captured live, because detailed manualannotation is mostly impractical. The level ofannotation should enable simple text-basedqueries. The annotation process includes suchactivities as segmenting the material into shots,grouping and classifying shots into semantic cat-egories (such as type of sport), and supportingqueries and retrieval of events that are significantto the particular sport. To achieve an effectiveannotation, we should have a clear insight intothe current practice and established standards inthe domain of professional sports videos, partic-ularly concerning the nature and structure oftheir content.

The videos that comprise the data set we usedfor the experiments in this article include a vari-ety of typologies. We drew most of the videosfrom the BBC Sports Library, which in turn col-lected them from other BBC departments andother broadcasters. The data set comprises morethan 15 video tapes, each lasting from 30 to 120minutes. The videos were mostly captured fromdigital video tapes and, in some cases, from S-VHS tapes. Digital video is the lowest acceptablestandard for broadcasting professionals, and itprovides digital quality at full PAL resolution.

Many of the videos in this sample collectionare from the 1992 Barcelona Olympics while somecontain soccer games from other events. Thus, weused various types of sports to perform our exper-iments. The videos differ from each other in termsof types of sports (outdoor and indoor sports) andthe number of athletes (single player or teams).Also, the videos differ in terms of editing—someare live feeds from a single camera for a complete

event, some include different feeds of one eventedited into a single stream, and others only fea-ture highlights of minor sports assembled in asummary. We weren’t able to make assumptionson the presence of a spoken commentary or super-imposed text because that depends on a numberof factors, including the technical facilities avail-able on location and the agreements between thehosting broadcaster and other broadcasters. Figure1, which includes sport sequences interwovenwith studio scenes, shows the typical structure of asports video; some videos might include superim-posed graphics (such as captions or logos).

The computational approachWe organized the annotation task into three

distinct subtasks:

1. preclassify shots (to extract the actual sportsactions from the video stream),

2. classify graphic features (which, in sportsvideos, are mainly text captions that aren’tsynchronized with shot changes), and

3. classify visual shot features.

Next, we’ll expound on the contextual analy-sis of the application domain. We analyze speci-ficity of data and provide an overview on therationale underlying how we selected relevantfeatures.

Preclassifying sports shotsOur anchorman/interview shot classification

module provides a simple preliminary classifica-tion of shot content, which subsequent modulescan also exploit and enhance. This type of classi-fication is necessary because some video feedscontain interviews and studio scenes featuring ananchorman and athletes. One example of this isthe Olympic Games, where the material to belogged is often pre-edited by the hosting broad-caster. The purpose of this module is to roughlyseparate shots that contain possible sport scenesfrom shots that do not. To this end, we can fol-low a statistical approach to analyze visual con-

53

Ap

ril–June 2002

Studio/interview Graphic objects Playing field Audience Player Player

Figure 1. Typical

sequence of shots in a

sports video.

Page 3: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

tent similarity and motion features of theanchorman shots, without requiring us to useany predefined shot content model as a refer-ence. In fact, our system doesn’t require this lat-ter constraint to be able to correctly manageinterviews, which don’t feature a standard studioset-up, because athletes are usually interviewednear the playing field and each interview has adifferent background and location. Also, detect-ing studio scenes requires such independence ofa shot content model because each program’sstyle is unique and often changes. This wouldrequire us to create and maintain a database ofshot content models.

Studio scenes have a well-defined syntax: shotlocation is consistent within the video, the num-ber of cameras and their view field is limited, andthe sequence of shot content is often a repeatingpattern. An example of such a structure is inFigure 2, which displays the first key frames ofshots comprising a studio scene.

Identifying graphic featuresIn sports videos, graphic objects may appear

anywhere within the frame, even if most of thetime they’re in the lower third or quarter of theimage. Also the vertical and horizontal ratio ofthe graphic object zones varies—for example, ateam roster might occupy a vertical box and oneathlete’s name might occupy a horizontal box(see Figure 3). For text graphics, character fontscan vary in size and typeface and may be super-imposed either on an opaque background ordirectly on the image captured by the camera.Graphic objects often appear and disappear grad-ually, with dissolve or fade effects. These proper-ties call for automatic graphic object localizationalgorithms with the least amount of heuristicsand possibly no training.

Past research has used several features such asedges and textures as cues of superimposedgraphic objects.8,9 Such features represent globalimage properties and require the analysis of largeframe patches. Moreover, natural objects such as

woods and leafs or man-made objects such asbuildings and cars can present a local combina-tion of such features that the algorithms mightwrongly classify as a graphic objects.10

To reduce visual information to a minimumand preserve local saliency, we’ve elected to workwith image corners, extracted from the images’luminance information. This is appealing for thepurpose of graphic-object detection and localiza-tion because it prevents many misclassificationproblems arising with color-based approaches.This is particularly important when consideringthe characteristics of TV standards, whichenforce a spatial subsampling of the chromaticinformation, causing the borders of captions tobe affected by color aliasing. Therefore, toenhance the readability of characters, producerstypically exploit luminance contrast becauseluminance isn’t spatially subsampled and humanvision is more sensitive to it than to color con-trast. Another distinguishing feature of ourapproach is that it doesn’t require any knowledgeor training on superimposed captions features.

Classifying visual shot featuresGeneric sports videos often feature numerous

different scene types intertwined with eachother in a live video feed reporting on a singleevent or edited into a segment summarizinghighlights of different events. A preliminaryanalysis of our data set—covering more than 10hours of sports events—revealed that three typesof scenes are most prevalent: the playing field,player, and audience (see Figure 4). Most of theaction of a sport game takes place on the playingfield—hence, the relevance of playing fieldscenes, showing mutual interactions among sub-jects (players, referees, and so on) and objects(ball, goal, hurdles, and so on). However, alongwith playing field scenes, a number of scenesoften appear, such as player close-ups and audi-ence shots. The former typically show a playerwho had a relevant role in the most recentaction—for example, the athlete who just failed

54

IEEE

Mul

tiM

edia

Figure 2. Studio scene

with alternating

anchorman shots.

Page 4: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

throwing the javelin or the player who shot apenalty. Audience shots occur at the beginningand at the end of an event, when nothing is hap-pening on the playing field, or just after a high-light. For example, in soccer, when a playershoots a goal, an audience shot showing thecrowd’s reaction is common.

In a sample of 1,267 key frames extracted ran-domly from our data set, approximately 9 per-cent were audience scenes. Player scenesrepresented up to 28 percent of the video mater-ial. To address the specificity of such a variety ofcontent types, we devised a hierarchical classifi-cation scheme. The first stage performs a classifi-cation in terms of the categories of playing field,player, and audience, with a twofold aim. On theone hand, this provides an annotation of videomaterial that is meaningful for users’ tasks. Onthe other hand, it’s instrumental for further pro-cessing, such as identifying sports type anddetecting highlights.

Inspecting the video material revealed that

❚ playing field shots typically feature large,homogeneous color regions and distinct longlines;

❚ in player shots, the player’s shape appears dis-tinctly in the foreground and the image’sbackground tends to be homogeneous orblurred (either because of camera motion orlens effects); and

❚ in audience shots, individuals in the audiencearen’t always clear but the audience as a wholeappears as a texture.

These observations suggest that basic edge andshape features could significantly help differen-tiate between playing field, player, and audiencescenes. It’s also worth pointing out that modelsfor these classes don’t vary significantly acrossdifferent sports, events, and sources.

We propose that identifying the type of sportin a shot relies on playing field detection. In fact,we can observe that

❚ most sports events take place on a playingfield, with each sport having its own playingfield;

❚ each playing field has several distinguishing fea-tures, the most relevant of which is color; and

❚ the playing field appears in many frames of avideo shot and often covers a large part of thecamera frame—that is, a large area of singleimages comprising the video.

Hence, the playing field and the objects that pop-ulate it can effectively support identification ofsports types. Therefore, in our approach, weapply sports type identification to playing fieldshots output by the previous classification stage.

ImplementationIn this section, we describe the implementa-

tions of modules supporting each subtask. Weshow how to compute the features and how weimplemented the feature combination rules.

55

Ap

ril–June 2002

Figure 3. Examples of superimposed graphic objects. (a) A team roster might

occupy a vertical box or (b) one athlete’s name might occupy a horizontal box.

(a) (b)

(a) (b) (c)

Figure 4. Although sports event footage can vary significantly, they share

distinguishing features. For example, (a) playing field lines are explicitly

present in some outdoor and indoor sports (such as soccer and swimming) but

can also be extended to other sports (such as cycling on public roads).

Similarly, (b) player and (c) audience scenes appear in most sports videos.

Page 5: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

Preclassifying sports shotsStudio and interview shots repeat at intervals

of variable length throughout a video sequence.The first step for classifying these shots stemsfrom this assumption and is based on the com-putation, for each video shot Sk, of a shot’s life-time L(Sk). The shot lifetime measures theshortest temporal interval that includes all theoccurrences of shots with similar visual contentwithin the video. Given a generic shot Sk, wecompute its lifetime by considering the set Tk ={ti | σ(Sk, Si) < τs}, where σ(Sk, Si) is a similarity mea-sure applied to key frames of shots Sk and Si, τs asimilarity threshold, and ti is the value of thetime variable corresponding to the occurrence ofthe key frame of shot Si. We define the lifetime ofshot Sk as L(Sk) = max(Tk) – min(Tk). We base shotclassification on fitting values of L(Sk) for all thevideo shots in a bimodal distribution. This lets usdetermine a threshold value tl that we use to clas-sify shots into the sport and studio/interview cat-egories. Particularly, we classify all the shots Sk sothat L(Sk) > tl are studio/interview shots, where tl

was determined according to the statistics of thetest database and set to 5 seconds. We classify theremaining shots as sport shots.

Typical videos in the target domain don’t con-tain complete studio shows, and in feeds pro-duced on location, interviews have a limited timeand shot length. This lets us reduce false detec-tions caused by the repetition of similar sportscenes (for example, in the case of edited pro-grams or summaries) by limiting the search ofsimilar shots to a window of shots whose widthwe set experimentally to six shots. The adoptedsimilarity metric is a histogram intersection ofthe mean color histogram of shots (H and S com-ponents of the HSI color space). Using the mean

histogram takes into account the dynamics ofsport scenes. In fact, even if some scenes takeplace in the same location, and thus the colorhistogram of their first frame may be similar, thefollowing actions yield a different color his-togram. When applied to studio and interviewshots, where the dynamics of the scene’s lightingchanges are much more compressed and thereduced camera and objects movement don’tintroduce new objects, we get a stable histogram.

Although the mean color histogram accountsfor minor variations due to camera and objectsmovement, it doesn’t account for spatial infor-mation. We refined the results from the first clas-sification step by considering motion features ofthe studio/interview shots. This develops on theassumption that in an anchorman shot both thecamera and the anchorman are almost steady. Incontrast, for sport shots, background objects andcamera movements—people, free-hand shots,camera panning and zooming, changes in scenelighting—cause relevant motion componentsthroughout the shot. We performed classificationrefinement by computing an index of the quan-tity of motion QM for each possible anchormanshot. The algorithm for the analysis of this indexconsiders the frame-to-frame difference betweenthe shot key-frame f1 and subsequent frames fi inthe shot according to a pixel-wise comparison.To enhance sensitivity to motion, the algorithmsubsamples the shot in time and compares theframes to the first key-frame f1. Only those shotswhose QM doesn’t exceed a threshold τM are defi-nitely classified as studio/interview shots.

We used a subset of three videos to test thealgorithm. There were 28 shots comprising shortinterviews of athletes and studio scenes. Thealgorithm identified 31 studio/interview shots,with five false detections and two missed detec-tions. Using the movement feature reduced thenumber of false detections from five to three. Theremaining false detections were due to slow-motion replays.

Identifying graphic featuresWe extracted the salient points of the frames,

which we analyze in the following steps, usingthe Harris algorithm, from the luminance map ofeach frame. Corner extraction greatly reduces theamount of spatial data to be processed by thegraphic-object detection and localization system.The most basic property of graphic objects is thatthey must remain stable for a certain amount oftime so people can read and understand them.

56

IEEE

Mul

tiM

edia

The most basic property of

graphic objects is that they

must remain stable for a

certain amount of time so

people can read and

understand them.

Page 6: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

We used this property in the firststep of graphic-object detection. Wechecked each corner to determine ifit’s still in the same position for atleast two more frames within a slid-ing window of four frames.

We marked each corner thatcomplied with this property as per-sistent and kept it for further analy-sis. We then discarded all theothers. Every eighth frame isprocessed to extract its corners, thusfurther reducing the computationalresources needed to process a wholevideo. This choice is based on theassumption that, in order for theviewer to perceive and understandit, a graphic object must be stableon the screen for 1 second. Weinspected the patch surroundingeach corner (20 × 14 pixels), and inthe absence of enough neighboringcorners (corners whose patchesdon’t intersect), we didn’t considerthe corner in further processing.This process, which we repeated inorder to eliminate corners that gotisolated after the first processing,assured that isolated high-contrastbackground objects contained with-in static scenes weren’t recognizedas possible graphic-object zones.

An unsupervised clustering is per-formed on the corners that comply with the tem-poral and spatial features we described earlier. Thisis aimed at determining bounding boxes forgraphic objects (see Figures 5 and 6). For eachbounding box, we calculate the percentage of pix-els that belong to the corner patches, and if it’sbelow a predefined threshold, we discarded thecorners. This strategy reduces the noise due tohigh-contrast background during static scenes thattypically produce small scattered zones of cornersthat the spatial feature analysis can’t eliminate.

The test set we used was composed of 19sports videos acquired from PAL digital videotapes at full PAL resolution and frame rate (720 ×576 pixels, 25 fps) resulting in 47,014 frames (31minutes 20 seconds).

Figure 6 provides an example of graphic-object detection. To test the robustness withrespect to text size and video resolution, we alsodigitized two S-VHS videos. One video wasacquired at full PAL resolution (720 × 576 pixels,

25 fps) resulting in 4,853 frames (3 minutes 14seconds), and the second one was acquired athalf PAL resolution (368 × 288, 25 fps) and con-tained 425 frames (totaling 17 seconds). Withthis latter resolution, some text captions in thevideos become only 9 pixels high.

When we evaluate our results, we considergraphic-object detection (whether the algorithmcorrectly detected a graphic object’s appearance)and correct detection of the graphic-object’sbounding box (see Table 1). Graphic-objectdetection has a precision of 80.6 percent andrecall of 92 percent. These figures are due tomissed detections in the VHS video and to only

57

Ap

ril–June 2002

(a) (b)

Figure 5. Determining graphic objects’ bounding boxes. (a) Source frame. (b) Detected

captions with noise removal.

(a) (b)

Figure 6. The results from detecting the bounding boxes in Figures 3a and 3b.

Table 1. Text event detection and text box localization.

Occurrences Misdetections False DetectionsGraphic-object event 63 5 9

Graphic-object boxes 100 5 37

Page 7: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

one missed detection in the digital videos. Thegraphic-object bounding box miss rate is 5 per-cent. Results also included six false detections,which were due to the presence of scene text.

Classifying visual shot featuresWe devised a feature vector comprising edge,

segment, and color features. Figure 7 shows someof the selected features for representatives of thethree classes playing field, player, and audience,respectively.

First, we performed edge detection and applieda recursive growing algorithm to the edges to

identify the image’s segments.11 We analyzed thedistribution of edge intensities to evaluate thedegree of uniformity. We also analyzed distribu-tions of lengths and orientations of the segmentsto extract the maximum length of segments in animage, as well as to detect whether peaks exist inthe distribution of orientations. Our choice wasdriven by these observations:

❚ playing field lines are characteristic segments inplaying field scenes,12 because these lines deter-mine peaks in the orientation histogram andlonger segments than in other types of scenes;

❚ audience scenes are typically characterized bymore or less uniform distributions for edgeintensities, segments orientation, and hue; and

❚ player scenes typically feature fewer edges, auniform segment orientation distribution,and short segments.

We also expect color features to increaserobustness to the first classification stage (forexample, audience scenes display more uniformcolor distributions than playing field or playerscenes) and to support sports type identification.In fact, the playing field each sport usually takesplace on typically features a few dominant col-ors—one or two, in most cases. This is particularlythe case in long and mid-range camera shots,where the frame area occupied by players is onlya fraction of the whole area. Furthermore, for eachsports type, the color of the playing field is eitherfixed or it varies in a small set of possibilities. Forexample, a soccer field is always green and a swim-ming pool is always blue. We describe color con-tent through color histograms. We selected theHSI color space and quantized it into 64 levels forhue, three levels for saturation, and three levels forintensity. We also derived indices describing thedistribution—that is, degree of uniformity andnumber of peaks—from these distributions.

We used two neural network classifiers to per-form the classification tasks. To evaluate their per-formance, we extracted more than 600 framesfrom a range of video shots and manually anno-tated them to define a ground truth. We thensubdivided the frames into three sets to train, test,and evaluate the classifiers. We computed theedge, segment, and color features for all theframes. Table 2 summarizes results for the scene-type classification. Extending this classificationscheme to shots, rather than just limiting it to key

58

IEEE

Mul

tiM

edia

Playing field Player Audience

Edge image

Edge intensity histogram (256 bins)

Segments

Segment length histogram (l∈[0, w2+h2), 30 bins)

Segment angle histogram(α∈[0, π), 18 bins)

Hue histogram(64 hues)

Figure 7. Edge, segment length and orientation, and hue distribution for the

three representative sample images from Figure 4. Synthetic indices derived

from these distributions let us differentiate between the three classes playing

field, player, and audience. (Note that we scaled hue histograms to the

maximum value.)

Page 8: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

frames, will yield an even better performance,because integrating results for key frames belong-ing to the same shot reduces error rates. Forinstance, some key frames of a playing field shotmight not contain playing field lines (for exam-ple, because of zoom), but others will. Hence, wecan classify the whole shot as a playing field shot.

Table 3 shows the results of the sports-typeidentification. The first column refers to an exper-iment we carried out on a data set including play-ing field, player, and audience scenes. The secondcolumn summarizes an experiment we carriedout after we used a filtering process that kept onlyplaying field frames. As expected, in the formercase, we obtained lower success rates. By compar-ing results in the two columns, we can observethat introducing the playing field, player, andaudience classes is instrumental to improvingidentification rates for sports types. On average,the rates improve by 16 percent, with a maxi-mum of 26 percent. The highest improvementrates are for those sports where the playing fieldis shown only for small time intervals (such ashigh diving) or in sports where only one athletecompetes, videos which frequently show close-ups of the athlete (such as the javelin).

Future workBesides refining the techniques we describe in

this article, our future work includes introducingnew semantic elements for a given semantic layer,such as motion and framing terms (for example,close-up versus long shots); increasing the overalllevel of semantic description (for example, byadding descriptors for events and relevant high-lights); and transitioning from elements of visualcontent to relationships among elements (spatio-temporal relations). We’re also implementingshape-analysis techniques to support highlightdetection (for example, by identifying playingfield lines, goals, hurdles, and so forth).

Eventually, our work should yield an exhaus-tive annotation of sports videos, letting us selecthighlights in a sports event to enhance produc-tion logging or extract metadata to achieve pos-terity logging. For example, we expect our systemto detect (missed) goals, penalties, or corner shotsin a soccer game by combining (camera) motionpatterns, information on the location of playerswith regards to playing field lines, and other rel-evant markers. When accessing historicalarchives, users could benefit from richly anno-tated files, helping them retrieve specific shots (ortypes of shots) on demand. MM

AcknowledgmentsThis work was partially supported by the

ASSAVID EU Project (Automatic Segmentationand Semantic Annotation of Sports Videos,http://www.bpe-rnd.co.uk/assavid/) under con-tract IST-13082. The consortium comprises ACSSpA, Italy; BBC R&D, UK; Institut Dalle MolleD’Intelligence Artificielle Perceptive (Dalle MolleInstitute for Perceptual Artifical Intelligence),Switzerland; Sony BPE, UK; the University ofFlorence, Italy; and the University of Surrey, UK.

References1. R.S. Heller and C.D. Martin, “A Media Taxonomy,”

IEEE MultiMedia, vol. 2, no. 4, Winter 1995, pp. 36-

45.

2. C. Colombo, A. Del Bimbo, and P. Pala, “Semantics

in Visual Information Retrieval,” IEEE MultiMedia,

vol. 6, no. 3, July–Sept. 1999, pp. 38-53.

3. S. Eickeler and S. Muller, “Content-Based Video

Indexing of TV Broadcast News Using Hidden

Markov Models,” Proc. IEEE Int’l Conf. Acoustics,

Speech, and Signal Processing (ICASSP), IEEE Press,

Piscataway, N.J., 1999, pp. 2997-3000.

59

Ap

ril–June 2002

Table 2. Results for the classification of key frames in terms of playing field,

player, and audience classes.

Class Correct (Percent) Missed (Percent) False (Percent)

Playing field 80.4 19.6 9.8

Player 84.8 15.2 15.1

Audience 92.5 7.5 9.8

Table 3. Sports-type identification results. The evaluation set in the first

experiment included playing field, player, and audience scenes. The second

set included only playing field scenes.

All Frames Playing Field Only

Sports Type (Percent) (Percent)High diving 56.9 83.2

Gymnastics floor exercises 78.7 97.4

Field hockey 85.0 95.1

Long horse 53.4 64.3

Javelin 37.8 58.8

Judo 80.6 96.9

Soccer 80.3 93.2

Swimming 77.4 96.1

Tennis 69.1 94.5

Track 88.2 92.7

Page 9: Content-Based Multimedia Indexing and Retrieval Semanticpdfs.semanticscholar.org/76ef/e0f7d881302dbdeed42a... · 2017-10-27 · deliver necessarily calls for higher levels of abstraction

4. H. Miyamori and S.-I. Iisaku, “Video Annotation for

Content-Based Retrieval Using Human Behavior

Analysis and Domain Knowledge,” Proc. Int’l Conf.

Automatic Face and Gesture Recognition 2000, IEEE

CS Press, Los Alamitos, Calif., 2000.

5. Y. Ariki and Y. Sugiyama, “Classification of TV

Sports News by DCT Features Using Multiple

Subspace Method,” Proc. 14th Int’l Conf. Pattern

Recognition (ICPR 98), IEEE CS Press, Los Alamitos,

Calif., 1998, pp. 1488-1491.

6. W. Zhou, A. Vellaikal, and C.C.J. Kuo, “Rule-Based

Video Classification System for Basketball Video

Indexing,” Proc. ACM Multimedia 2000 Workshop,

ACM Press, New York, 2000, pp. 213-216.

7. N. Dimitrova et al., “Entry into the Content Forest:

The Role of Multimedia Portals,” IEEE MultiMedia,

vol. 7, no. 3, July–Sept. 2000, pp. 14-20.

8. T. Sato et al., “Video OCR for Digital News

Archive,” Proc. IEEE Int’l Workshop Content-Based

Access of Image and Video Databases (CAIVD 98),

IEEE CS Press, Los Alamitos, Calif., 1998, pp. 52-60.

9. Y. Zhong, H. Zangh, and A.K. Jain, “Automatic

Caption Localization in Compressed Video,” IEEE

Trans. Pattern Analysis and Machine Intelligence, vol.

22, no. 4, Apr. 2000, pp. 385-392.

10. H. Li and D. Doermann, “Automatic Identification

of Text in Digital Video Key Frames,” Proc. Int’l

Conf. Pattern Recognition (ICPR 98), IEEE CS Press,

Los Alamitos, Calif., 1998, pp. 129-132.

11. R.C. Nelson, “Finding Line Segments by Stick

Growing,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 16, no. 5, May 1994, pp. 519-523.

12. Y. Gong et al., “Automatic Parsing of TV Soccer

Programs,” Proc. Int’l Conf. Multimedia Computing

and Systems (ICMCS 95), IEEE CS Press, Los

Alamitos, Calif., 1995, pp. 167-172.

Jürgen Assfalg is a research asso-

ciate at the Department of

Information Systems at the

University of Florence, Italy. His

research activity addresses multi-

media databases, advanced user

interfaces, usability engineering, virtual reality, and 3D

graphics. He graduated with a degree in electronics

engineering and has a PhD in information and

telecommunications technology from the University

of Florence, Italy. He is an IEEE and ACM member.

Readers may contact Jürgen Assfalg at the

Dipartimento di Sistemi e Informatica, Via S. Marta 3,

50139 Firenze, Italy, email [email protected].

Marco Bertini is a PhD candidate

in information and telecommu-

nications technology at the

University of Florence, Italy. His

research interests include con-

tent-based video annotation and

retrieval and Web user interfaces. He has a Laurea

degree in electronics engineering from the University

of Florence, Italy.

Carlo Colombo is a senior assis-

tant professor at the University of

Florence, Italy. His current

research activities focus on theo-

retical and applied computer

vision for semi-autonomous

robotics, advanced human–machine interaction, and

multimedia systems technology. He has an MS in elec-

tronic engineering from the University of Florence,

Italy, and a PhD in robotics from the Sant’Anna School

of University Studies and Doctoral Research, Pisa, Italy.

He is an editorial board member of the Journal of

Robotics and Autonomous Systems.

Alberto Del Bimbo is a full pro-

fessor of computer engineering

and the Director of the Master in

Multimedia of the University of

Florence, Italy. He is also the

Deputy Rector of the University

of Florence, in charge of research and innovation trans-

fer. His research interests include pattern recognition,

image databases, multimedia, and human–computer

interaction. He graduated with honors in electronic

engineering from the University of Florence, Italy. He

is the author of the Visual Information Retrieval mono-

graph on content-based retrieval from image and video

databases. He is also associate editor of Pattern

Recognition, the Journal of Visual Languages and

Computing, the Multimedia Tools and Applications

Journal, Pattern Analysis and Applications, the IEEE

Transactions on Multimedia, and the IEEE Transactions

on Pattern Analysis and Machine Intelligence. He is a

member of the IEEE and the International Association

for Pattern Recognition (IAPR).

For further information on this or any other computing

topic, please visit our Digital Library at http://computer.

org/publications/dlib.

60

IEEE

Mul

tiM

edia