statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of...

17
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1,JANUARY 2000 3 Statistical Models of Video Structure for Content Analysis and Characterization Nuno Vasconcelos, Student Member, IEEE, and Andrew Lippman, Member, IEEE Abstract—Content structure plays an important role in the un- derstanding of video. In this paper, we argue that knowledge about structure can be used both as a means to improve the performance of content analysis and to extract features that convey semantic in- formation about the content. We introduce statistical models for two important components of this structure, shot duration and ac- tivity, and demonstrate the usefulness of these models with two practical applications. First, we develop a Bayesian formulation for the shot segmentation problem that is shown to extend the standard thresholding model in an adaptive and intuitive way, leading to im- proved segmentation accuracy. Second, by applying the transfor- mation into the shot duration/activity feature space to a database of movie clips, we also illustrate how the Bayesian model captures semantic properties of the content. We suggest ways in which these properties can be used as a basis for intuitive content-based access to movie libraries. Index Terms—Bayes procedures, shot duration and activity, video databases, video modeling, video representations, video segmentation, video semantics, Weibull prior. I. INTRODUCTION C URRENT video characterization and retrieval systems rely on image representations based on low-level visual primitives such as color, texture, and motion. While practical and computationally efficient, such characterization places on the interface to the retrieval system the burden of bridging the semantic gap between the low-level nature of the primitives and the high-level semantics that people rely on to perform the task. Because establishing this bridge is a difficult problem, there is interest in alternative representations built upon content descriptors at a higher semantic level. The most obvious of these alternatives is probably that of object-based representations. The ability to decompose a scene into the objects that compose it would allow semantic descrip- tions of arbitrary detail and enable intuitive video manipula- tion [7]. This type of scene decomposition can, however, only be achieved through sophisticated image segmentation. Despite significant recent progress in unsupervised segmentation [39], [49], [52], [53], [59], it does not seem that such algorithms, applicable across the different types of imagery that make up video databases and capable of producing semantic segmenta- tions, will be achievable in the near future. On the other hand, su- pervised segmentation [9], [10] is not viable for video libraries, Manuscript received November 27, 1998; revised July 9, 1999. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Dr. Hong Jiang Zhang. The authors are with the Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail: [email protected]; [email protected]). Publisher Item Identifier S 1057-7149(00)00187-1. where very large volumes of content must be processed over rel- atively short periods of time. This does not imply that region-based representations are not useful for content characterization and retrieval. In fact, several researchers have shown that they provide valuable extensions to the “query by example” paradigm [8], [11], [13], [41], [58]. The idea is to be able to search images by similar regions of coherent color, texture, shape, and motion instead of simpler feature rep- resentations such as the image histograms [43] or color layouts used by early retrieval systems [15], [34]. However, while re- gions may be better than simple features, the fact that they are not necessarily meaningful to people still poses some major dif- ficulties. For example, it is not uncommon for an object to be segmented into several regions of different color or motion, one of these regions leading to an unintuitive match with a seman- tically unrelated object. Furthermore, not all regions have the same perceptual significance, and it is not easy to decide how to weight each of them for the purpose of characterizing video. Similarly to the earlier feature-based methods, one can al- ways count on the user to reduce the gap between the machine representation and his/her own. In the context of region-based representations, this can be achieved by asking the user to in- teract with the machine in terms of regions. Some of the more sophisticated retrieval systems do just this: the user is allowed to formulate a query by specifying a few regions and their motion [11], or the internal region-based representation of the system is displayed to help the user figure out what went wrong during an unsuccessful query [8]. This type of interaction by reverse engi- neering of the machine representation and reasoning can how- ever be unintuitive, at least for naive users. The fact is that people characterize content according to high- level concepts, such as its amount of action, romance, or comedy that, most of the time, are not even related in a straightforward way to the visual attributes of the pixels that compose each image. Therefore, there is a need for new representations ca- pable of capturing such high-level properties of the video. We believe that the most promising path for achieving this goal is to move away from the objective of understanding all the pixels in each image and concentrate instead on the higher level structure exhibited by the video. In fact, we argue that for most content classes that one would care to store in a video database, this structure is plentiful and plays a significant role in perceptual decoding of the video. The long-term goal of the research is to identify semantically relevant features and build computational models for their extraction and characterization. In this paper, we give the first steps toward this goal by developing statistical models for two important elements of the video structure: shot duration and activity. 1057–7149/00$10.00 © 2000 IEEE

Upload: others

Post on 24-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000 3

Statistical Models of Video Structure for ContentAnalysis and Characterization

Nuno Vasconcelos, Student Member, IEEE,and Andrew Lippman, Member, IEEE

Abstract—Content structure plays an important role in the un-derstanding of video. In this paper, we argue that knowledge aboutstructure can be used both as a means to improve the performanceof content analysis and to extract features that convey semantic in-formation about the content. We introduce statistical models fortwo important components of this structure, shot duration and ac-tivity, and demonstrate the usefulness of these models with twopractical applications. First, we develop a Bayesian formulation forthe shot segmentation problem that is shown to extend the standardthresholding model in an adaptive and intuitive way, leading to im-proved segmentation accuracy. Second, by applying the transfor-mation into the shot duration/activity feature space to a databaseof movie clips, we also illustrate how the Bayesian model capturessemantic properties of the content. We suggest ways in which theseproperties can be used as a basis for intuitive content-based accessto movie libraries.

Index Terms—Bayes procedures, shot duration and activity,video databases, video modeling, video representations, videosegmentation, video semantics, Weibull prior.

I. INTRODUCTION

CURRENT video characterization and retrieval systemsrely on image representations based on low-level visual

primitives such as color, texture, and motion. While practicaland computationally efficient, such characterization places onthe interface to the retrieval system the burden of bridging thesemantic gap between the low-level nature of the primitivesand the high-level semantics that people rely on to perform thetask. Because establishing this bridge is a difficult problem,there is interest in alternative representations built upon contentdescriptors at a higher semantic level.

The most obvious of these alternatives is probably that ofobject-based representations. The ability to decompose a sceneinto the objects that compose it would allow semantic descrip-tions of arbitrary detail and enable intuitive video manipula-tion [7]. This type of scene decomposition can, however, onlybe achieved through sophisticated image segmentation. Despitesignificant recent progress in unsupervised segmentation [39],[49], [52], [53], [59], it does not seem that such algorithms,applicable across the different types of imagery that make upvideo databases and capable of producing semantic segmenta-tions, will be achievable in the near future. On the other hand, su-pervised segmentation [9], [10] is not viable for video libraries,

Manuscript received November 27, 1998; revised July 9, 1999. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Hong Jiang Zhang.

The authors are with the Media Laboratory, Massachusetts Institute ofTechnology, Cambridge, MA 02139 USA (e-mail: [email protected];[email protected]).

Publisher Item Identifier S 1057-7149(00)00187-1.

where very large volumes of content must be processed over rel-atively short periods of time.

This does not imply that region-based representations are notuseful for content characterization and retrieval. In fact, severalresearchers have shown that they provide valuable extensions tothe “query by example” paradigm [8], [11], [13], [41], [58]. Theidea is to be able to search images by similar regions of coherentcolor, texture, shape, and motion instead of simpler feature rep-resentations such as the image histograms [43] or color layoutsused by early retrieval systems [15], [34]. However, while re-gions may be better than simple features, the fact that they arenot necessarily meaningful to people still poses some major dif-ficulties. For example, it is not uncommon for an object to besegmented into several regions of different color or motion, oneof these regions leading to an unintuitive match with a seman-tically unrelated object. Furthermore, not all regions have thesame perceptual significance, and it is not easy to decide how toweight each of them for the purpose of characterizing video.

Similarly to the earlier feature-based methods, one can al-ways count on the user to reduce the gap between the machinerepresentation and his/her own. In the context of region-basedrepresentations, this can be achieved by asking the user to in-teract with the machine in terms of regions. Some of the moresophisticated retrieval systems do just this: the user is allowed toformulate a query by specifying a few regions and their motion[11], or the internal region-based representation of the system isdisplayed to help the user figure out what went wrong during anunsuccessful query [8]. This type of interaction by reverse engi-neering of the machine representation and reasoning can how-ever be unintuitive, at least for naive users.

The fact is that people characterize content according to high-level concepts, such as its amount of action, romance, or comedythat, most of the time, are not even related in a straightforwardway to the visual attributes of the pixels that compose eachimage. Therefore, there is a need for new representations ca-pable of capturing such high-level properties of the video. Webelieve that the most promising path for achieving this goal is tomove away from the objective of understanding all the pixels ineach image and concentrate instead on the higher level structureexhibited by the video. In fact, we argue that for most contentclasses that one would care to store in a video database, thisstructure is plentiful and plays a significant role in perceptualdecoding of the video. The long-term goal of the research is toidentify semantically relevant features and build computationalmodels for their extraction and characterization. In this paper,we give the first steps toward this goal by developing statisticalmodels for two important elements of the video structure: shotduration and activity.

1057–7149/00$10.00 © 2000 IEEE

Page 2: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

4 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

While building good models for video is an interesting exer-cise, the ultimate measure of success of a given model is the effi-ciency of the solutions for practical and objective tasks that maybe derived from it. In this context, once the models are built, weconcentrate on two such tasks: temporal video segmentation andsemanticcharacterization.Becauseknowledgeof thevideostruc-ture is a form of prior knowledge, Bayesian procedures [19] be-come a natural solution for these tasks. We therefore introduce aBayesian framework for segmentation and characterization thatis shown to significantly outperform currently used approachesand to capture relevant semantic properties of the content.

The paper is organized as follows. In Section II, we argue thathigh-level structure is prevalent in most video domains and hasa direct impact in our ability to understand the video. We iden-tify two important semantic features, shot duration and activity,and introduce statistical models for these features in Sections IIIand IV. The remainder of the paper is devoted to practical appli-cations of these models. In Section V, we introduce a Bayesiansolution to the problem of temporal video segmentation. A de-tailed experimental analysis of its performance is carried outin Section VI. Section VII then illustrates the semantic contentcharacterization ability of the shot-duration/scene-activity fea-ture-space on a large database of movie clips. Finally, SectionVIII presents some conclusions and pointers for future work.Preliminary reports on this work were previously presented in[47] and [50].

II. V IDEO STRUCTURE

The most obvious example of a domain where structure playsa significant role in the characterization of video is that of tele-vision newscasts. A newscast contains a significant amount ofboth spatial and temporal structure. Examples of spatial struc-ture are the standardized spatial layouts to which the shots of theanchor-person and the various separators between different sec-tions of the newscast usually obey. Examples of temporal struc-ture are the regularly spaced appearances of the anchor-person,usually indicating the start of a new story, or the standardizedtemporal intervals in which the different news sections are cov-ered (e.g., the sports section always appears “x” minutes afterthe start of the newscast). As a consequence of all this structure,it is relatively easy to give a machine the understanding of com-mands such as “skip ahead to the sports section” without theneed to understand all the pixels in every image of the video se-quence. This has made the topic of analyzing newscasts a verypopular one in the video databases literature [3], [21], [27], [56].

The ability to infer semantic information is directly related tothe amount of structure exhibited by the content. While news-casts are at the highly structured end of the spectrum and con-stitute one of the easiest classes to analyze, the raw output ofa personal camcorder exhibits almost no structure and is typ-ically too difficult to characterize semantically [26]. Betweenthese two extrema, there are various types of content for whichthe characterization has a varying level of difficulty. Our inter-ests are mostly in domains that are more generic than newscasts,but still follow enough content production codes to exhibit a sig-nificant amount of structure. A good example of such a domainis that of feature films.

A. Structure in Movies

While the analysis of all the elements that contribute to thestructure of a movie would require significantly more space thanwhat it is available here, it is worth to point out that there aresome well-known principles in film theory that can be exploitedfor the purpose of semantic characterization. In particular, it iswell known that the stylistic elements of a movie are closelyrelated to the message conveyed in its story [5], [30], [37]. His-torically, these stylistic elements have been grouped into twomajor categories: theelements of montageand theelements ofmise-en-scene. Montage refers to the temporal structure, namelythe aspects of film editing or the way in which the different shotsare composed to form the scenes in the movie. On the otherhand, mise-en-scene deals with spatial structure, i.e., the com-position of each image, and includes variables such as the typeof set in which the scene develops, the placement of the actorson the scene, aspects of lighting, focus, camera angles, and soon.

From the content characterization perspective, the importantpoint is that, while both elements of montage and mise-en-scenecan be used to manipulate the emotions of the audience (this ma-nipulation is, after all, the ultimate goal of the director), thereare some very well established codes or rules to achieve this.For example, a director trying to put forth a text deeply rootedin the construction of character (e.g., a drama or a romance) willnecessarily have to rely on a fair amount of facial close-ups, asclose-ups are the most powerful tool for displaying emotion,1 anessential requirement to establish a strong bond between the au-dience and the characters in the story. If, on the other hand, thesame director is trying to put forth a text of the action or sus-pense genres, the elements of mise-en-scene become less rel-evant than the rhythmic patterns of montage. In action or sus-pense scenes, it is imperative to rely on fast cutting, and ma-nipulation of the cutting rate is the tool of choice for keepingthe audience “at the edge of their seats.” Directors who exhibitsupreme mastery in the manipulation of the editing patterns areeven referred to asmontage directors.2

It is therefore to be expected that the analysis of the elementsof montage and mise-en-scene can lead to feature spaces wherethe content is laid out in a way that would allow a machineto characterize it on a semantic basis. While there is a funda-mental element of montage, the shot duration, it is harder toidentify a single defining characteristic of mise-en-scene. It is,nevertheless, clear that among the elements of mise-en-scene,one important property for the semantic characterization of amovie is the amount of activity in its shots: while action-packedmovies typically contain a strong component of active shots,character-based movies are mainly composed of scenes withshots (e.g., dialogues) of smaller activity. Furthermore, impor-tant events for the semantic classification of action movies, suchas explosions, car chases, or fights are typically associated withhighly active shots. Finally, the amount of activity is usually cor-related with the amount of violence in the content (at least that of

1The importance of close-ups is best summarized in the quote from CharlesChaplin: “Tragedy is a close-up, comedy a long shot.”

2The best known example in this class is Alfred Hitchcock, who relied inten-sively on editing to create suspense in movies likePsychoor Birds [5].

Page 3: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 5

Fig. 1. Shot duration histogram and maximum likelihood fit obtained with theErlang distribution.

Fig. 2. Shot duration histogram and maximum likelihood fit obtained with theWeibull distribution.

a gratuitous nature) and can provide clues for its detection. Forthese reasons, in this work, we develop computational modelsfor the shot duration and activity.

III. M ODELING SHOT DURATION

Several probabilistic models can be used for shot duration.Because shot boundaries can be seen as arrivals over discrete,nonoverlapping temporal intervals, a Poisson process seems anappropriate prior for the task of boundary detection [14]. How-ever, events generated by Poisson processes have interarrivaltimes characterized by the exponential density which is a mono-tonically decreasing function of time. This is clearly not thecase for the shot duration, as can be seen from the histogramof Figs. 1 and 2. In this work, we consider two more genericmodels, the Erlang and Weibull distributions, which providemore realistic models for shot duration.

A. Erlang Model

The first model that we consider for the time elapsed since theprevious shot boundary is the Erlang distribution [14]. Denotingby the time since the previous boundary, the Erlang distribu-tion is described by

(1)

It is a generalization of the exponential density, characterizedby two parameters: the order, and the expected interarrivaltime ( ) of the underlying Poisson process. When ,the Erlang distribution becomes the exponential distribution.For larger values of, it characterizes the time between thethorder interarrival time of the Poisson process. This leads to anintuitive explanation for the use of the Erlang distribution as amodel of shot duration: for a given order, the shot is mod-eled as a sequence ofevents that are themselves the outcomesof Poisson processes. Such events may reflect properties of theshot content, such as “setting the context” through a wide angleview followed by “zooming in on the details” when , or“emotional buildup” followed by “action” and “action outcome”when .

Our experiments show that the Erlang distribution providesa good model for shot duration. Fig. 1 presents a shot durationhistogram, obtained from the training set to be described in Sec-tion VI, and its maximum likelihood (ML) Erlang fit obtainedaccording to the procedure described in Appendix A. It can beseen that the Erlang fit is a good approximation to the empiricaldensity.

The main limitation of the Erlang model is its dependence onthe constant arrival rate assumption [22] inherent to the under-lying Poisson process. Becauseis a constant, the expected rateof occurrence of a new shot boundary in the next frame intervalis the same if 10 s or 1 h have elapsed since the occurrence ofthe previous one. This type of behavior is clearly inappropriatefor most physical processes and several alternative models havebeen proposed in the statistics literature to handle the problem.We next consider one such alternative: theWeibulldistribution[22].

B. Weibull Model

The Weibull distribution generalizes the exponential distribu-tion by considering an expected rate of arrival of new events thatis a function of time

and of the parametersand , leading to a probability densityof the form

(2)

For most values of , the distribution does not have heavytails, and robust estimation procedures are required to avoid highsensitivity to outliers. Fig. 2 presents the robust ML Weibull fit,

Page 4: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

6 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

obtained according to the procedure described in Appendix B,to the shot duration histogram of Fig. 1. Once again we obtain agood approximation to the empirical density estimate providedby the histogram.

IV. M ODELING SHOT ACTIVITY

Given a sequence of images, a considerable number ofmethods can be used to obtain an estimate of the amount ofactivity in the underlying scene. In this work, we consider twometrics that can be derived from low-level image measure-ments: thecolor histogram distanceand thetangent distancebetween successive images in the sequence.

A. Color Histogram Distance

The color histogram distance has been widely used as a mea-sure of (dis)similarity between images for the purposes of ob-ject recognition [17], [43], content-based retrieval [15], [24],[28], [33], [41], and temporal video segmentation [6], [18], [31],[42], [54], [57]. A histogram is first computed for each image inthe sequence, and the distance between successive histograms isused as a measure of local activity. Among the metrics proposedin the literature, we use the norm of the histogram difference

(3)

where and are histograms of successive frames, andthenumber of histogram bins. This metric has been shown to per-form well for temporal video segmentation [6] and is equivalent(for normalized histograms) to thehistogram intersectionmetric[43] commonly used in the retrieval literature.

Since histograms are invariant to most types of motion eitherby the camera or the objects in the scene, they perform very wellfor tasks where invariance is a big plus, e.g., video segmentationand retrieval. It is not clear, however, that invariance is a majoradvantage for a metric of scene activity, since part of what has tobe measured is precisely the amount of motion in the scene. Infact, the metric should only be invariant with respect to cameramotions, e.g., pans and zooms, which do not necessarily corre-late with activity.3 The tangent distance [40] between successiveimages has this property.

B. Tangent Distance

The key idea behind the tangent distance is that, whensubject to spatial transformations, images span manifoldsembedded in a much higher dimensional Euclidean space, anda metric invariant to those transformations should measurethe distance between those manifolds instead of the distancebetween other properties of (or features extracted from) theimages themselves. However, because the manifolds can bevery complex, minimizing the distance between them is a hardoptimization problem. The problem can be made tractable byconsidering instead the minimization of the distance betweenthe tangent planes to the manifolds.

3For example, pans are prevalent on scenic videos that depict scenes of verylow activity.

Given two images and , and a transformationparameterized by the vector, the distance between the associ-ated manifolds is

(4)

Assuming, for simplicity, that one of the images () is fixed,and replacing by the tangent plane at the point ,we obtain the (one-sided) tangent distance

(5)Many transformations can be used in this equation.

Because we are mostly interested in invariance againstcamera motion, we consider the set of affine transformations

, with

(6)

capable of compensating for translation (panning), scaling(zooming), in-plane rotation, and shearing. The cost functionof (5) can be minimized using a multiresolution variant ofNewton’s method [4], leading to the following algorithm [2],[48]. For a given level of the multiresolution decomposition,follow these steps.

1) Compute by warping image according to thebest current estimate of, and compute its spatial gradient

.2) Update the estimate of according to

3) Stop if convergence, otherwise go to 1.Once the final is obtained, it is passed to the multiresolu-

tion level below, by simply doubling the translation parameters.The rescaled vector is then used as initial estimate at the level

, and the above process repeated. Once this iterative proce-dure has converged for all levels of the multiresolution decom-position, the tangent distance between the images is computedthrough (5), using the optimal parameter vector.

C. Statistical Models of Activity

Given a set of activity features, the second step of themodeling consists of finding good statistical representationsfor them. Here, it is important to realize that there may not be auniversal model applicable all of the time but different modelsmay be best suited for the different states through which thevideo progresses. For example, the principle ofcontinuity inediting states that, in order not to confuse the viewer, the firstframes after a shot transition should be significantly differentthan the last frames before it [37]. Thus, during a typicalshot transition, any activity metric is likely to take values thatare significantly different from those observed within eachshot. For simplicity, in this work we restrict ourselves to a

Page 5: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 7

video model composed of two states: “regular frames” and“shot transitions.” The fundamental principles are howeverapplicable to more complex models with more states.

1) Modeling State Densities:We start by defining twostates: associated with regular frames, andassociated with shot boundaries. These states are not directlymeasurable from the video, but can only be inferred from theactivity features described above. We designate these featuresby . Once the states are identified, the goal is to design goodconditional density models for the observed activity featuresgiven each state. This is not as easy as in the case of the shotduration, since there do not seem to exist simple densities thatcan provide a good fit to the empirical observations. In order toovercome this limitation, we rely on generic mixture models.

A mixture density [45] is defined as

where is the number of mixture components, the proba-bility of the th class, the likelihood of the observeddata for this class, and the corresponding parameter vector.For example, in the case of a Gaussian mixture component

and

while for an Erlang component and[as defined by (1)], and for a uniform component

and , where and are theextrema of the uniform density. The basic idea behind this modelis that each observation of is generated in two steps: first aclass is selected according to the, and the observation is thendrawn from the corresponding class likelihood.

Given an observed sample , the ML es-timates for the mixture parameters , canbe obtained with the expectation-maximization (EM) algorithm[12], [36]. EM is an iterative procedure that iterates between anexpectation (E) and a maximization (M) step. The E-step com-putes the posterior probability, , that a sample point wasdrawn from mixture component

class

The M-step then finds the set of parameters that maximize theweighted log-likelihood of the sample given the class assign-ments computed in the E-step

under the constraint .The algorithm is guaranteed to converge to at least a local

maximum of the likelihood of the sample [12]. Furthermore,it can be shown [35], [36] using Lagrange multipliers that, in-

dependently of the conditional densities considered inthe model, the optimal values for the updates are

The M-step updates for the remaining parameters are similarto standard ML estimates, with the difference that we now haveto take into account the weights . Usually this leads to a slightmodification of the expressions obtained by ML. For example,in the Gaussian case [35], [36]

(7)

while in the Erlang case , and can be found by a proce-dure similar to the one described in Appendix A using a slightlymodified expression for the optimal value of

2) Regular Frames:Fig. 3 presents the histogram of the ac-tivity features for the “regular frames” state for both the his-togram and tangent distance metrics. It is clear that the distri-butions are very similar: asymmetric about their mean, alwayspositive and concentrated near zero. This suggests that a mixtureof Erlang distributions is an appropriate model for this state, asuggestion that is confirmed by the fits obtained with EM, thatare also depicted in the figure. In both cases, we have used amixture of three Erlang components and a uniform component.The uniform component accounts for the tails of the distributionand, due to its small amplitude, is not perceptible in the figures.

3) Shot Transitions:Fig. 4 presents the histogram of the ac-tivity features for the shot transition state together with the fitobtained with a mixture of a Gaussian and a uniform densities.Once again, the uniform density accounts for the tails of the dis-tribution. When the shot transition state is considered, a simpleGaussian model appears to provide a reasonable approximationto the empirical density. This model has indeed been previouslyused as a basis for setting shot detection thresholds [57].

The fact that the density estimates depicted in Figs. 1–4approximate well the empirical observations indicates thatthe models now introduced are good representations for shotduration and activity. However, because the ultimate measure ofeffectiveness of any given model is the degree to which it leadsto effective solutions for objective tasks, in the remainder ofthe paper we concentrate on two such tasks: shot segmentationand semantic content characterization.

V. A BAYESIAN FRAMEWORK FORSHOT SEGMENTATION

Because shot segmentation is a prerequisite for virtually anytask involving the understanding, parsing, indexing, characteri-

Page 6: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

8 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

Fig. 3. Conditional activity histogram given that there are no shot changes, andbest fit by a mixture with three Erlang components and a uniform component.Top: histogram distance. Bottom: tangent distance.

zation, or categorization of video, the grouping of video framesinto shots has been an active topic of research in the area ofcontent-based retrieval [6], [18], [20], [31], [42], [54], [57]. Ex-tensive evaluation of various approaches has shown that simplethresholding of histogram distances performs surprisingly welland is difficult to beat [6], [18]. In this work, we consider analternative formulation that regards the problem as one of sta-tistical inference between two hypotheses:

• : No shot boundary occurs between the two framesunder analysis ( );

• : A shot boundary occurs between the two frames ().

.In this setting, the optimal decision is provided by a likelihood

ratio test where is chosen if

(8)

and is chosen, otherwise. It can be shown that standardthresholding is a particular case of this statistical formulation,in which both conditional densities are assumed to be Gaussians

Fig. 4. Conditional activity histogram for shot transitions, and best fit by amixture with a Gaussian and a uniform component. Top: histogram distance.Bottom: tangent distance.

with the same covariance. From the discussion in the previoussection, we know that this does not hold for real video. One fur-ther limitation of the thresholding model is that it does not takeinto account the fact that the likelihood of a new shot transitionis dependent on how much time has elapsed since the previousone. On the other hand, the statistical formulation can easilyincorporate the shot duration models developed in Section III.For this, we start by making some conventions with regards tonotation.

A. Notation

Because video is a discrete process, characterized by a givenframe rate, shot boundaries are not instantaneous, but last forone frame period. To account for this, states are defined overtime intervals, i.e., instead of or , we have

or , where is the start of a time in-terval, and its duration. We designate the features observedduring the interval by .

To simplify the notation, we reservefor the temporal in-stant at which the last shot boundary has occurred and make all

Page 7: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 9

temporal indexes relative to this instant. For example, instead ofwe write , or simply if . Further-

more, we reserve the symbolfor the duration of the intervalbetween successive frames (inverse of the frame rate), and usethe same notation for a simple frame interval and a vector offrame intervals (the temporal indexes being themselves enoughto avoid ambiguity). For example, while indicatesthat no shot boundary is present in the interval ,

indicates that no shot boundary has occurred in anyof the frames betweenand . Similarly, repre-sents the vector of observations in .

B. Bayesian Formulation

Given that there is a shot boundary at timeand no boundariesoccur in the interval , the posterior probability that thenext shot change happens during the intervalis, using Bayes rule,

where is a normalizing constant (also known as partition func-tion). Similarly, the probability that there is no change in

is

and theposterior odds ratio[19] between the two hypothesis is

(9)

where we have assumed that, given , is indepen-dent of all other and . In this expression, while the first termon the right-hand side is the ratio of the conditional likelihoodsof activity given the state sequence, the second term is simplythe ratio of probabilities that there may (or not) be a shot tran-sition units of time after the previous one. In the Bayesianterminology, the shot duration density becomesa prior for thesegmentation process. This is intuitive, since knowledge aboutthe shot duration is a form of prior knowledge about the struc-ture of the video that should be used to favor segmentations thatarea priori more plausible.

Assuming further that is stationary, defining, considering the probability density function

for the time elapsed until the first scene change after, and

taking logarithms, leads to alog posterior odds ratio ofthe form

(10)

The optimal answer to the question if a shot change occurs ornot in is thus to declare that a boundary existsif

(11)

and that there is no boundary otherwise. By comparing thisequation with (8), it is clear that the inclusion of the prior modelfor shot duration transforms the fixed thresholding approach ofthat equation into an adaptive one, where the threshold dependson how much time has elapsed since the previous shot boundary.We next study the behavior of this threshold for each of the shotduration models introduced in Section III.

1) Erlang Model: Using the results of Appendix A, it can beseen that the threshold of (11) is particularly simple to computeunder the Erlang assumption, where

(12)

The log posterior odds ratio test for detecting a shotchange—(11)—then becomes

(13)

where

(14)

The typical temporal behavior of this threshold is presented inFig. 5. While in the initial segment of the shot, the thresholdis very high and shot changes are very unlikely to be accepted,the threshold decreases as the scene progresses increasing thelikelihood that shot boundaries will be declared.

Even though the qualitative behavior of the threshold is whatone would desire, a closer observation of the figure reveals themajor limitation of the Erlang prior: its steady-state behavior.Ideally, in addition to decreasing monotonically over time, thethreshold should not be lower bounded by a positive value asthis may lead to situations in which its steady-state value is highenough to miss several consecutive shot boundaries. Instead, thethreshold should at some point become negative, guaranteeingthat, in steady-state, any shot boundary detectable without a

Page 8: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

10 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

Fig. 5. Log posterior odds ratio threshold as a function of the time elapsedsince the beginning of the current shot, for Erlang distribution.

prior is still detectable when the prior is introduced. Unfortu-nately, such a steady-state behavior is not achievable with anErlang prior, for which

This limitation is a consequence of the constant arrival rate as-sumption discussed in Section III and can be avoided by relyinginstead on the Weibull model.

2) Weibull Model: Similarly to the Erlang prior, thethreshold of (11) is easy to compute under the Weibull model.As shown in Appendix B

(15)

from which

(16)

and the threshold has the temporal behavior illustrated by Fig. 6.Unlike the threshold associated with the Erlang prior,tends to when grows without bound. This guaranteesthat a new shot boundary will always be found if one waits longenough.

In summary, we see that both the Erlang and the Weibullpriors lead to adaptive thresholds that are more intuitive thanthe fixed threshold commonly employed for shot segmentation.This suggests that the Bayesian approach may have higher accu-racy. In the next section we confirm this intuition with detailedexperimental results.

Fig. 6. Log posterior odds ratio threshold for Weibull distribution.

VI. SEGMENTATION RESULTS

The performance of the Bayesian shot detection modulewas evaluated on a database containing 23 promotional movietrailers for commercially released feature films. Each trailerconsists of 2–5 min of video and the total number of shotsin the database is 1959. The movie titles are presented inTable I. In all experiments, performance was evaluated by theleave-one-outmethod, i.e., one trailer was held for evaluation ofthe segmentation accuracy and all the remaining were includedin a training set used to learn model parameters. Ground truthwas obtained by manual segmentation of all the trailers.

We start by evaluating the relative performance of Bayesianmodels with different shot duration priors and compare it againstthe best possible performance achievable with a fixed threshold.For the latter approach, the value of the optimal threshold isobtained by brute-force, i.e., testing several values and selectingthe one which performed the best. For completeness, besides theErlang and Weibull priors we also tested the performance of thePoisson prior.

The first experiment was designed to evaluate the dependenceof segmentation accuracy on the free variables of the model,namely the number of components in each of the mixture den-sities. Fig. 7 presents graphs of the number of segmentation er-rors on the entire database for the three priors and two featuresets (histogram and tangent distance) considered. The curves ineach graph depict the total error as a function of the numberof Gaussians in the scene change activity model. Each curvecorresponds to a different number of Erlang components in theactivity model associated with regular frames. Also shown, asa constant line, in each graph is the best performance achievedwith a fixed threshold.

Several observations can be made from the figure. First, whilethe Poisson prior leads to worse accuracy than that achievablewith the static threshold, it is clear that both the Erlang and theWeibull priors lead to significant improvement over the perfor-mance achieved in the non-Bayesian setting. Second, the his-togram distance performs better than the tangent distance. Thiswas expected, given the discussion in Section IV. The main point

Page 9: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 11

TABLE ITITLES OF THE ENTRIES IN THE MOVIE

DATABASE AND NAMES THAT APPEAR ON THELEGEND OFFIG. 12

is however that the Bayesian framework is generic and improvesthe segmentation accuracy for both distance metrics.

Third, performance is insensitive to the number of Gaussiansin the scene-change activity model. In fact, most of the curvesare approximately constant. There is, however, some depen-dence on the number of Erlang components in the activity modelfor regular frames. In particular, a single component is insuffi-cient for both feature sets, two components lead to the best re-sults for the histogram distance, and three are required by thetangent distance. It is however relevant to notice that for a givenset of distance features the model that performs best for all priorstends to be the same. This indicates that the Bayesian model isrobust against possible mismatches in prior selection or poor pa-rameter estimates.

Finally, the Weibull prior achieves the overall best perfor-mance. It not only leads to the smallest number of errors forboth distance metrics, but it is also more robust than the otherpriors. Notice how, once the minimum number of Erlang com-ponents required to obtain a good fit to the underlying density

is reached, the curves are approximately flat and exhibit similarvalues. The better performance of the Weibull prior is also vis-ible in Fig. 8, which depicts the total number of errors, false pos-itives, and missed boundaries for the best model associated witheach prior. Notice that with this prior the Bayesian approach de-creases the error rate of the standard static threshold by 15% to20% and that the 20% gain is obtained with the best distancemetric (histogram).

With respect to false positives and misses, the use of theWeibull prior leads to better performance than all the other ap-proaches when the tangent distance features are used. In the caseof histogram distances, there does not seem to be a clear cutwinner. While the Erlang prior leads to the smallest number offalse positives, the fixed threshold leads to the smallest numberof misses. The Weibull prior ranks second in each class, pro-viding a good compromise between the two types of errors.Overall, one can conclude that the Weibull prior has the bestperformance in terms of the tradeoff between false positives andmisses.

One final issue of practical concern is the robustness of thesegmentation against inaccuracies in the estimates of the priorparameters. This is particularly relevant in the Weibull casesince, as discussed in Section III-B, it leads to a threshold that isunbounded and goes to for large . It is thus possible thatpoor parameter estimates may lead to a threshold that decaystoo quickly originating too many false positives. Becausewe are relying on a robust estimate of theparameter, theestimates of this parameter will be, by definition, insensitive tooutliers. It remains to investigate the effect of poor estimates ofthe parameter on the segmentation performance. For this, wehave conducted a series of experiments where the performancewas evaluated for predetermined values ofranging from 1–4with intervals of 0.2, on the histogram distance feature set.

Fig. 9 presents the results of this experiment. It is clear that,for a large range ( ), the performance attainedwith the Weibull prior is superior to that of the fixed thresholdand over a significant range ( ) it is very close to theoptimal. Furthermore, both the number of false positives andmisses are also approximately constant over a significant rangeof .

The reasons for the improved performance of Bayesian seg-mentation are illustrated by Fig. 10, which presents the evolu-tion of the thresholding process for one of the trailers in the data-base (“blankman”). Two thresholding approaches are depicted:the one derived from the Bayesian formulation with the Weibullprior and standard fixed thresholding. The activity features are,in both cases, histogram distances.

Notice that the adaptive behavior of the Bayesian thresholdsignificantly increases the robustness against spurious peaks ofthe activity metric originated by events such as very fast mo-tion, explosions, camera flashes, etc. This is visible, for ex-ample, in the shot between frames 772 and 794, which depicts ascene composed of fast moving black and white graphics wherefigure and ground are frequently reversed. While the Bayesianprocedure originates a single false-positive, fixed thresholdingproduces several. This is despite the fact that we have comple-mented the plain fixed threshold with commonly used heuris-tics, such as rejecting sequences of consecutive shot boundaries

Page 10: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

12 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

Fig. 7. Total error as a function of the number of model components. Left: histogram distance. Right: tangent distance. From top to bottom: Poisson, Erlang,and Weibull priors. Errors are shown as a function of the number of Gaussians in the scene change activity model given the number of Erlang components intheactivity model for regular frames. “Fixed” is the best performance achieved with a static threshold.

[6]. The vanilla fixed threshold would originate many more er-rors. Fig. 11 presents a few frames from this shot, illustratingthe type of motion and the intensity variations contained in it.

Page 11: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 13

Fig. 8. Total number of errors, false positives, and missed boundaries for thevarious approaches. Top: histogram distance. Bottom: tangent distance.

VII. SEMANTIC CHARACTERIZATION

We have argued in Sections I and IV that a coarse semanticcharacterization can be achieved by mapping each video streamin a video library into a two-dimensional (2–D) feature spacecapturing the average shot duration and activity. In order to eval-uate the characterization ability of this transformation, we ap-plied it to the trailer library introduced above.

Fig. 12 shows how the movies populate the feature space.The features were obtained by segmenting the video into shotsand simply measuring the average duration of each shot and theaverage value of the activity feature for the regular frames inthe shot. Results are presented for the two activity metrics dis-cussed in the previous sections and were normalized to [0, 1] bydividing by the maximum value along each axis. The segmen-tations that ranked best in the previous section were the onesemployed. We also performed a search in the Internet MovieDatabase (IMDB) [1] for thegenreassigned to each movie bythe Motion Picture Association of America. Three major classeswere identified:romance/comedy, action, andother (which in-cludeshorror, drama, andadventure). There were not enoughpoints in the movie sample to further subdivide the other class

Fig. 9. Total number of errors, false positives, and missed boundaries as afunction of� for the Weibull prior and the histogram distance features. Thestraight line is the best performance achieved with a fixed threshold.

in a meaningful way. The genre classes are indicated in the plotsby the symbol used to represent each movie.

While there are not enough points in the sample to take defin-itive conclusions, several interesting observations can be madefrom the figure. First, while there are differences in the details,the overall behavior is the same for the two graphs. In particular,the points seem to obey a law of the typelength× activity= con-stant. This is particularly interesting because the existence of arelated law,character× action= constanthas been postulated inthe film theory literature [29]. This seems to 1) confirm the factthat the metrics of activity on which we have relied are a goodindicator for theactioncontent of a movie and 2) indicate thatthe shot length is a good metric for the amount ofcharacterina movie. The later relationship is intuitive, since the construc-tion of complex characters requires extensive use of dialogueand this is inherently lengthy.

Second, there seems to be a clear separation between thethree semantic classes in the activity/length feature space. Inparticular, movies of theromanceandcomedygenres are mostlyabove the top dashed line,actionmovies below the bottom one,and the other genres in between. There are only two moviesthat consistently violate these rules: “jungle” and “madness.”Further investigation reveals the reason for these outliers: whilethe romances above the top line either belong to the categorydrama/romanceor comedy/romance,“jungle” is categorized asadventure/romanceindicating a degree of action that is unusualfor movies in theromanceclass. On the other hand, while “mad-ness” is assigned to thehorror genre, it is full of action-packedscenes. More samples from thehorror class would be necessaryfor a deeper analysis of the interplay between these two genres.

In addition to these two consistent outliers, there are twoadditional ones in each graph. When the histogram distance isused as activity feature, both “riverwild” (action) and “scout”(drama) incorrectly appear in theromance/comedyregion,“riverwild” is a good example of why the histogram distancemight fail as a metric of activity. It is an action movie whoseplot revolves around white-water rafting and contains numerous

Page 12: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

14 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

Fig. 10. Evolution of the thresholding process for a challenging trailer. Top: Bayesian segmentation. The likelihood ratio and the Weibull threshold are displayed.Bottom: Fixed thresholding. The histogram distances and the optimal threshold (determined by theleave-one-outmethod using the other trailers in the database)are presented. In both graphs, missed boundaries are signaled by stars, while false positives are indicated by circles.

shots depicting this sport. While these shots exhibit a significantamount of motion from frame-to-frame, the color histograms

tend not to change too much, because there is always plenty ofwater in the background. The action content cannot, therefore,

Page 13: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 15

Fig. 11. Six frames from a shot in the database displayed in raster-scan order. Notice the fast motion and the reversals of foreground/background color.

be captured well by the histogram distance. Since, as discussedin Section IV, the tangent distance performs significantly betterunder these situations, it is not surprising that the movie iscorrectly classified when the latter is used as activity metric.

In fact, the outliers created by the tangent distance aremuch more intuitive than those originated by the histogramdistance. In addition to “jungle” and “madness,” it misplacesthe comedies “blankman” and “edwood.” However, while thecomedies above the top dashed line are typically categorizedas comedy/romanceor simply comedy, “edwood” receivesthe awkward categorization ofcomedy/drama(indicating thatcharacterizing its content is probably a difficult task), and“blankman” that ofcomedy/screwball/super heroconfirmingthe fact that it is an action-packed comedy, which could easilyfall in the action category. Thus, while, strictly speaking, theplacement of these movies on theaction andother classes isincorrect, it is semantically plausible.

In summary, it appears that the lines superimposed on the shotlength/activity space are likely to have a semantic meaning andthat this space is one where the movies are nicely laid out ac-cording to the degree of action in their scripts, providing dis-crimination between several genres of content. This propertycould prove useful for applications where a coarse characteriza-

tion of the content into semantic classes can be used as a quickfiltering mechanism, allowing users to rapidly eliminate items inwhich they clearly have no interest. For example, a graph suchas the ones above could be used as a graphical front-end to amovie library providing an intuitive way for the identification ofmovies from different genres. On the other hand, the semanticcharacterization could also be used to provide constraints for aquery-by-example content-based retrieval system. This has infact been recently reported in [23], where a two-way classifi-cation of shots intoactionandcharacterclasses, based on thefeatures introduced above, is combined with a more traditionalmetric of visual similarity for the purposes of retrieval. It isshown that the inclusion of semantic constraints improves theretrieval accuracy.

VIII. D ISCUSSION

In this paper, we have argued that the analysis of video struc-ture plays an important role for its semantic understanding.In particular, knowledge about structure can be used both asa means to improve low-level video processing tasks and toextract features that convey semantic information. We havepresented statistical models for two important aspects of video

Page 14: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

16 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

Fig. 12. Population of the feature space by the movies in our database. Top: histogram distance. Bottom: tangent distance. Movie names are listed in Table I.

structure and used them as the basis for a Bayesian formulationof shot segmentation. This formulation was shown to extendthe standard thresholding model in an adaptive and intuitiveway, leading to improved shot segmentation. By applying thefeature transformation required for Bayesian segmentation to a

database of movie clips, we have illustrated how the Bayesianmodel captures important semantic properties of the contentthat can be useful for content-based access to movie libraries.

There are several issues in semantic characterization that wedid not address here. In particular, it would be interesting to

Page 15: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 17

build a movie classifier based on the feature space of Fig. 12.Unfortunately, to have high confidence in the resulting classifi-cation rates, we would have to process hundreds or even thou-sands of movies, a requirement that is beyond the reach of ourcurrent computational resources. There are however questionsthat will be feasible to study in the near term. First, it wouldbe interesting to extend the segmentation framework presentedhere to the more difficult topic of scene segmentation. While,in principle, the framework is still valid, it remains to be seenif the models now proposed can also account for scene dura-tion, if there are better alternatives, what features should beused to detect scene boundaries, and how would the frameworkcompare against other methods presented in the literature [25],[55]. Second, it would be interesting to develop models for el-ements of mise-en-scene other than activity. While there hasbeen a significant amount of work in the machine vision, imageprocessing, and content-based retrieval communities toward thedevelopment of modules capable of extracting semantic infor-mation from images [16], [44], [46], a much smaller amountof work has been devoted to the development of a frameworkfor the integration of those modules into a complete retrievalsystem. We have recently argued that Bayesian inference is theappropriate tool for this integration [26], [51], and are currentlyinvestigating the use of Bayesian principles to achieve more ex-tensive semantic characterization.

APPENDIX APROPERTIES OF THEERLANG DISTRIBUTION

In this appendix, we derive expressions for the ML estimationof the parameters of the Erlang distribution from a set of trainingdata, and the integral of its density function over an interval.We start by deriving, from (1), the log-likelihood of the Erlangdensity

(17)Given a sample of independent Er-

lang random variables, the ML parameter estimates are obtainedfrom

(18)

This is a maximization problem over both discrete () and con-tinuous ( ) variables, and such problems are typically hard tosolve. However, because it is a sum ofindependent arrivaltimes of a Poisson process, by the central limit theorem [14],the Erlang density can be approximated by a Gaussian for large. Therefore, one can either rely on the Gaussian approximation

or assume that the range of interest ofis small. Since in all ourexperiments we have found that the histograms for shot durationare never close to those of a Gaussian random variable, we haveindeed relied on this assumption.

For small , the maximization becomes significantly simplersince it is possible to do an exhaustive search over all. For this:

1) select a range for. In our experiments we have usedbetween one and ten;

2) for each , find the that maximizes (18);

3) select the value of that leads to the largest overall log-likelihood.

Given , the maximization of (18) is straightforward. Substi-tuting (17) in (18), taking the derivative with respect toandsetting it to zero, we obtain

(19)

We next consider the evaluation of

(20)

and use integration by parts to obtain

from which

This equation can be solved recursively, leading to

(21)

APPENDIX BPROPERTIES OF THEWEIBULL DISTRIBUTION

In this appendix, we derive expressions for the ML estima-tion of the parameters of the Weibull distribution from a set oftraining data, and the integral of its density function over an in-terval. We start by deriving, from (2), the log-likelihood of theWeibull density

(22)

Given a sample of independentWeibull random variables, the ML parameter estimates areobtained from

(23)

Taking derivatives with respect toand , we obtain

Page 16: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

18 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 9, NO. 1, JANUARY 2000

(24)

Setting the derivatives to zero leads to a system of equationsthat cannot be solved in closed form. The maximization can,however, still be performed numerically by relying on gradientdescent techniques such as Newton’s method [4]. In our ex-periments, we adopted a simpler variation of this process thatconsisted of restricting the parameterto be a multiple of 0.5.This variation was inspired by the method introduced above todeal with the Erlang case and can be solved in similar fashion.Namely, we find the optimal value for assuming that isknown and then perform an exhaustive search over. Giventhe optimal is the one that sets (24) to zero, i.e.,

(25)

We can thus see that the optimalis a function of the samplemean of the power of the sample values. It is well known inthe statistics literature [38] that the sample mean is very sensi-tive to the presence of outliers in the data. More robust estimatescan be achieved by replacing the sample mean by the samplemedian, leading to

median median (26)

We next consider the evaluation of

Using the change of variable

it is straightforward to show that

(27)

ACKNOWLEDGMENT

The authors would like to thank G. Iyengar for several stim-ulating discussions about this work, and the two anonymous

reviewers for their comments that led to improvements in thispaper.

REFERENCES

[1] Internet Movie Database at http://us.imdb.com/.[2] P. Anandan, J. Bergen, K. Hanna, and R. Hingorani, “Hierarchical

model-based motion estimation,” inMotion Analysis and ImageSequence Processing, M. Sezan and R. Lagendijk, Eds. Boston, MA:Kluwer, 1993, ch. 1.

[3] Y. Ariki and Y. Saito, “Extraction of TV news articles based on scene cutdetection using DCT clustering,” inIEEE Int. Conf. Image Processing,Lausanne, Switzerland, 1996, pp. 847–850.

[4] D. Bertsekas,Nonlinear Programming. Belmont, MA: Athena Scien-tific, 1995.

[5] D. Bordwell and K. Thompson,Film Art: An Introduction, New York:McGraw-Hill, 1986.

[6] J. Boreczky and L. Rowe, “Comparison of video shot boundary detectiontechniques,” inProc. SPIE Conf. on Visual Communication and ImageProcessing, 1996.

[7] V. Bove, J. Dakss, S. Agamanolis, and E. Chalom, “Adding hyperlinksto digital television,” inProc. SMPTE 140th Technical Conf., 1998.

[8] C. Carson, S. Belongie, H. Greenspan, and J. Malik, “Region-basedimage querying,” inWorkshop in Content-Based Access to Image andVideo Libraries, San Juan, P R., 1997, pp. 42–49.

[9] R. Castagno, T. Ebrahimi, and M. Kunt, “Video segmentation based onmultiple features for interactive multimedia applications,”IEEE Trans.Circuits Syst. Video Technol., vol. 8, pp. 562–571, Sept. 1998.

[10] E. Chalom, “Statistical image sequence segmentation using multidimen-sional attributes,” Ph.D. dissertation, Mass. Inst. Technol., Cambridge,1998.

[11] S. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong, “A fullyautomated content-based video search engine supporting spatiotemporalqueries,”IEEE Trans. Circuits Syst.Video Technol., vol. 8, pp. 602–615,Sept. 1998.

[12] A. Dempster, N. Laird, and D. Rubin, “Maximum-likelihood from in-complete data via the EM algorithm,”J. R. Stat. Soc. B, vol. 39, pp.1–22, 1977.

[13] Y. Deng and B. Majunath, “NeTra-V: Toward an object-based videorepresentation,”IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp.616–627, Sept. 1998.

[14] A. Drake,Fundamentals of Applied Probability Theory, New York: Mc-Graw-Hill, 1987.

[15] W. Niblack et al., “The QBIC project: Querying images by contentusing color, texture, and shape,” inProc. Conf. Storage and Retrievalfor Image and Video Databases. San Jose, CA, Feb. 1993, pp. 173–181.

[16] D. Forsyth and M. Fleck, “Body plans,” inProc. IEEE Computer SocietyConf. Computer Vision and Pattern Recognition, San Juan, PR, 1997, pp.678–683.

[17] B. Funt and G. Finlayson, “Color constant color indexing,”IEEE Trans.Pattern. Anal. Machine Intell., vol. 17, pp. 522–529, May 1995.

[18] U. Gargi, R. Kasturi, and S. Antani, “Performance characterization andcomparison of video indexing algorithms,” inIEEE Computer SocietyConf. Computer Vision and Pattern Recognition, Santa Barbara, CA,1997, pp. 559–565.

[19] A. Gelman, J. Carlin, H. Stern, and D. Rubin,Bayesian Data Anal-ysis. London, U.K.: Chapman & Hall, 1995.

[20] B. Gunsel and M. Tekalp, “Content-based video abstraction,” inIEEEInt. Conf. Image Processing, Chicago, IL, 1998, pp. 128–132.

[21] A. Hanjalic, R. Lagendijk, and J. Biemond, “Template-based detectionof anchorperson shots in news programs,” inIEEE Int. Conf. Image Pro-cessing, Chicago, IL, 1998, pp. 148–152.

[22] R. Hogg and E. Tanis,Probability and Statistical Inference, New York:Macmillan, 1993.

[23] G. Iyengar and A. Lippman, “Semantically controlled content-based re-trieval of video sequences,” inProc. SPIE Conf. Multimedia Storage andArchiving Systems III, Boston, MA, 1998, pp. 216–227.

[24] A. Jain and A. Vailaya, “Image retrieval using color and shape,”PatternRecognit., vol. 29, pp. 1233–1244, Aug. 1996.

[25] J. Kender and B. Yeo, “Video scene segmentation via continuous videocoherence,” inIEEE Computer Society Conf. Computer Vision and Pat-tern Recognition, Santa Barbara, CA, 1998, pp. 367–373.

[26] A. Lippman, N. Vasconcelos, and G. Iyengar, “Human interfaces tovideo,” in 32nd Asilomar Conf. Signals, Systems, and Computers,Asilomar, CA, 1998.

Page 17: Statistical models of video structure for content analysis and … · 2017. 9. 28. · decoding of the video. The long-term goal of the research is to identify semantically relevant

VASCONCELOS AND LIPPMAN: STATISTICAL MODELS OF VIDEO STRUCTURE 19

[27] C. Low, Q. Tian, and H. Zhang, “An automatic news video parsing, in-dexing, and browsing system,” inProc. ACM Multimedia Conf., Boston,MA, 1996.

[28] W. Ma and H. Zhang, “Benchmarking of image features for content-based retrieval,” in32nd Asilomar Conf. Signals, Systems, and Com-puters, Asilomar, CA, 1998.

[29] W. Martin, Recent Theories of Narrative. Ithaca, NY: Cornell Univ.Press, 1986, ch. 5.

[30] J. Monaco,How to Read a Movie. Oxford, U.K.: Oxford Univ. Press,1981.

[31] A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-videosearch for object appearances,” inVisual Database Systems II, E. Knuthand L. Wegner, Eds. Amsterdam, The Netherlands: Elsevier, 1992, pp.173–127.

[32] W. Niblacket al., “The QBIC project: Querying images by content usingcolor, texture, and shape,” inStorage and Retrieval for Image and VideoDatabases II. San Jose, CA, Feb. 1993, pp. 173–181.

[33] G. Pass, R. Zabih, and J. Miller, “Comparing images using color co-herence vectors,” inProc. ACM Int. Multimedia Conf., Nov. 1996, pp.65–73.

[34] A. Pentland, R. Picard, and S. Sclaroff, “Photobook: Content-basedmanipulation of image databases,”Int. J. Comput. Vis., vol. 18, pp.233–254, June 1996.

[35] L. Rabiner and B. Juang, Eds.,Fundamentals of Speech Recogni-tion. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[36] R. Redner and H. Walker, “Mixture densities, maximum likelihood andthe EM algorithm,”SIAM Rev., vol. 26, pp. 195–239, Apr. 1984.

[37] K. Reisz and G. Millar,The Technique of Film Editing. Oxford, U.K.:Focal Press, 1968.

[38] P. Rousseeuw and A. Leroy,Robust Regression and Outlier Detection,New York: Wiley, 1987.

[39] P. Salembier and M. Pardas, “Hierachical morphological segmentationfor image sequence coding,”IEEE Trans. Image Processing, vol. 3, pp.639–651, Sept. 1994.

[40] P. Simard, Y. Le Cun, and J. Denker, “Efficient pattern recognition usinga new transformation distance,” inProc. Neural Information ProcessingSystems, Denver, CO, 1994.

[41] J. Smith and S. Chang, “VisualSEEk: A fully automated content-basedimage query system,” inACM Multimedia, Boston, 1996, pp. 87–98.

[42] S. Smoliar and H. Zhang, “Video indexing and retrieval,” inMultimediaSystems and Techniques, B. Furth, Ed. Boston, MA: KAP, 1996.

[43] M. Swain and D. Ballard, “Color indexing,”Int. J. Comput. Vis., vol. 7,pp. 11–32, 1991.

[44] M. Szummer and R. Picard, “Indoor-outdoor image classification,”in Proc. Workshop in Content-Based Access to Image and VideoDatabases, Bombay, India, 1998.

[45] D. Titterington, A. Smith, and U. Makov,Statistical Analysis of FiniteMixture Distributions, New York: Wiley, 1985.

[46] M. Turk and A. Pentland, “Eigenfaces for recognition,”J. Cogn. Neu-rosci., vol. 3, pp. 71–86, 1991.

[47] N. Vasconcelos and A. Lippman, “A Bayesian video modeling frame-work for shot segmentation and content characterization,” inProc. IEEEWorkshop on Content-based Access to Image and Video Libraries. SanJuan, PR, 1997.

[48] , “Multiresolution tangent distance for affine invariant classifica-tion,” in Proc. Conf. Neural Information Processing Systems, Denver,CO, 1977.

[49] , “Empirical Bayesian EM-based motion segmentation,” inProc.IEEE Computer Vision and Pattern Recognition Conf., San Juan, PR,1997.

[50] , “Toward semantically meaningful feature spaces for the charac-terization of video content,” inProc. Int. Conf. Image Processing, SantaBarbara, CA, 1997.

[51] , “A Bayesian framework for semantic content characterization,”in Proc. IEEE Conf. Computer Vision and Pattern Recognition, SantaBarbara, CA, 1998.

[52] J. Wang and E. Adelson, “Representing moving images with layers,”IEEE Trans. Image Processing, vol. 3, pp. 625–638, Sept. 1994.

[53] Y. Weiss, “Smoothness in layers: Motion segmentation using nonpara-metric mixture estimation,” inProc. Conf. Computer Vision and PatternRecognition, San Juan, P.R., 1997.

[54] B. Yeo and B. Liu, “Rapid scene analysis on compressed video,”IEEETrans. Circuits Syst. Video Technol., vol. 5, pp. 533–544, Dec. 1995.

[55] M. Yeung and B. Yeo, “Time-constrained clustering for segmentation ofvideo into story units,” inICPR’96, 1996, pp. 375–380.

[56] H. Zhang, Y. Gong, S. Smoliar, and S. Tan, “Automatic parsing of newsvideo,” in Proc. Int. Conf. Multimedia Computing and Systems, Boston,MA, May 1994.

[57] H. Zhang, S. Smoliar, and J. Wu, “Content-based video browsing tools,”in Proc. Symp. Electronic Imaging Science and Technology: MultimediaComputing and Networking, A. Rodriguez and J. Maitan, Eds. SanJose, CA: SPIE, Feb. 1995, vol. 2417, pp. 389–398.

[58] H. Zhang, J. Wang, and Y. Altunbasak, “Content-based video retrievaland compression: A unified solution,” inProc. IEEE Int. Conf. ImageProcessing, Santa Barbara, CA, 1997, pp. 13–16.

[59] J. Zhang, J. Modestino, and D. Langan, “Maximum-likelihood param-eter estimation for unsupervised stochastic model-based image segmen-tation,” IEEE Trans. Image Processing, vol. 3, pp. 404–420, July 1994.

Nuno Vasconcelos(S’92) received the licenciaturain electrical engineering and computer science fromthe Universidade do Porto, Portugal, in 1988, andthe M.S. degree from the Massachusetts Instituteof Technology (MIT), Cambridge, in 1993. He isnow pursuing the Ph.D. degree at the MIT MediaLaboratory, where he has been a Research Assistantsince 1991.

During this period he has worked in several topicsincluding video analysis and compression, motionestimation and segmentation, image representations,

content-based image retrieval, semantic characterization of video content, andstatistical modeling of visual data. His current interests are in the areas ofimage processing, machine vision, statistical learning, and multimedia.

Andrew Lippman (M’78) received the B.S. andM.S. degrees in electrical engineering from theMassachusetts Institute of Technology (MIT),Cambridge, in 1971 and 1978, respectively, and thePh.D. degree from the Swiss Federal Institute ofTechnology, Lausanne, Switzerland, in 1995.

He is currently the Associate Director of the MITMedia Laboratory and a Lecturer at MIT, where hehas directed research programs on image processing,personal computers, entertainment, and graphics. Re-cently, he has established and directs a research con-

sortium entitled “Digital Life” that addresses bits, people, and community in awired world. He holds seven patents in television and digital image processing,and has published widely in both the technical and lay literature. His current in-terests are in the design of flexible, interactive digital television infrastructures.