multimedia lecture notes

105
Lecture Notes: Multimedia Information Systems Marcel Worring Intelligent Sensory Information Systems, University of Amsterdam [email protected] http://www.science.uva.nl/~worring tel 020 5257521

Upload: golo2011

Post on 30-Oct-2014

155 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Multimedia Lecture Notes

Lecture Notes: Multimedia Information Systems

Marcel WorringIntelligent Sensory Information Systems, University of Amsterdam

[email protected]://www.science.uva.nl/~worring

tel 020 5257521

Page 2: Multimedia Lecture Notes

2

Page 3: Multimedia Lecture Notes

Contents

1 Introduction 7

2 Data, domains, and applications 132.1 Data and applications . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Categorization of data . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Ontology creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5 Produced video documents . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.1 Semantic index . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.3 Recording the scene . . . . . . . . . . . . . . . . . . . . . . . 192.5.4 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Data representation 213.1 Categorization of multimedia descriptions . . . . . . . . . . . . . . . 213.2 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Audio representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Audio features . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 Audio segmentation . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 Temporal relations . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Image representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.1 Image features . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . 253.4.3 Spatial relations . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Video representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5.1 Video features . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5.2 Video segmentation . . . . . . . . . . . . . . . . . . . . . . . 263.5.3 Spatio-temporal relations . . . . . . . . . . . . . . . . . . . . 26

3.6 Text representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6.1 Text features . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6.2 Text segmentation . . . . . . . . . . . . . . . . . . . . . . . . 283.6.3 Textual relations . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Other data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Similarity 294.1 Perceptual similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Layout similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Semantic similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3

Page 4: Multimedia Lecture Notes

4 CONTENTS

5 Multimedia Indexing Tools 335.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Pattern recognition methods . . . . . . . . . . . . . . . . . . . . . . . 335.3 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . 395.5 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Video Indexing 416.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2 Video document segmentation . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 Layout reconstruction . . . . . . . . . . . . . . . . . . . . . . 436.2.2 Content segmentation . . . . . . . . . . . . . . . . . . . . . . 44

6.3 Multimodal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3.1 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.4 Semantic video indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 516.4.1 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.4.2 Sub-genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4.3 Logical units . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4.4 Named events . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Data mining 637.1 Classification learning . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Association learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.4 Numeric prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 Information visualization 678.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.2 Visualizing video content . . . . . . . . . . . . . . . . . . . . . . . . 68

9 Interactive Retrieval 739.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739.2 User goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739.3 Query Space: definition . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.3.1 Query Space: data items . . . . . . . . . . . . . . . . . . . . . 759.3.2 Query Space: features . . . . . . . . . . . . . . . . . . . . . . 759.3.3 Query Space: similarity . . . . . . . . . . . . . . . . . . . . . 769.3.4 Query Space: interpretations . . . . . . . . . . . . . . . . . . 76

9.4 Query Space: processing steps . . . . . . . . . . . . . . . . . . . . . . 769.4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.4.2 Query specification . . . . . . . . . . . . . . . . . . . . . . . . 779.4.3 Query output . . . . . . . . . . . . . . . . . . . . . . . . . . . 819.4.4 Query output evaluation . . . . . . . . . . . . . . . . . . . . . 81

9.5 Query space interaction . . . . . . . . . . . . . . . . . . . . . . . . . 829.5.1 Interaction goals . . . . . . . . . . . . . . . . . . . . . . . . . 839.5.2 Query space display . . . . . . . . . . . . . . . . . . . . . . . 839.5.3 User feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.5.4 Evaluation of the interaction process . . . . . . . . . . . . . . 85

9.6 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . 86

Page 5: Multimedia Lecture Notes

CONTENTS 5

10 Data export 8910.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8910.2 MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8910.3 SMIL: Synchronized Multimedia Integration Language . . . . . . . . 91

Page 6: Multimedia Lecture Notes

6 CONTENTS

Page 7: Multimedia Lecture Notes

Chapter 1

Introduction

1

A staggering amount of digital data goes round in the world. Brought to youby television or the Internet, as communication between businesses, or acquiredthrough various sensors like your digital (video)camera. Increasingly, the digitaldata stream is composed of multimedia data items, i.e. a combination of pictorial,auditory and textual data.

With the increasing capabilities of computers, processing the multimedia datastream for further exploitation will be common practice. Processors are still ex-pected to double their speed every 18 months for the near foreseeable future; mem-ory systems sizes will double every 12 months and grow to Terabytes of storage;network bandwidth will significantly improve every 9 months, increasing their re-sources even faster. So there is enough capacity to fill with multimedia data streams.

Observing how Internet is evolving, it is obvious that users prefer a multimediastyle of information exchange and interaction, including pictures, video and sound.We foresee that the same turnover will occur in professional life. The emphasis ininformation handling will gradually shift from categorical and numerical informationto information in multimedia form: pictorial, auditory and free text data. So,multimedia is the message [after McLuhan phrase of the sixties that the medium isthe message, predicting the effect television would have on society].

Multimedia information - when offered as an option with equal accessibility asthe current information forms - will be the preferred format as it is much closerto human forms of communication. Multimedia information presentation from thesystem to user and interpretation from the user to the system employs our coor-dinated communication skills, enabling us to control system interactions in a moretransparent experience then ever before. More natural interaction with informationand communication services for a wide range of users will generate a significantsocietal and economic value. Hence, it is safe to predict that multimedia will enterall information systems, in business to business communication, on the professionaldesk, in communication with clients, in education, in science and in arts, and inbroadcasting via the Internet. Each of these worlds will eventually pose their ownquestions but for now a set of tools is needed as keys to unlock this future.

Multimedia may become the natural form to communicate, however, it doesnot come for free entirely. The semantics of a message is much more imminent forcategorical and numeric information than the bit stream of pictures and sounds.What tells the picture, what is meant by a text, and what is said in the audio signalare not easy to deduce with a computer. These semantics need to be conqueredfrom the raw data by structuring, transformation, analysis and classification.

1This introduction is an adaptation of the introduction of Smeulders et.al. in “BSIK proposalMultimediaN”.

7

Page 8: Multimedia Lecture Notes

8 CHAPTER 1. INTRODUCTION

Most of the multimedia information digitally available is irrelevant to the user,but some tiny fraction hidden in the information flood is highly relevant. Thechallenge is to locate these nuggets of multimedia experiences or knowledge. Muchof the information is directed to an anonymous receiver, many pieces just sit therewaiting to be accessed, and some valuable information is hidden in the stream ofinformation passing by.

Multimedia in its full exploitation requires the handling of video, pictures, audioand language. It includes the man-machine interaction in multi-modal form. Inaddition, knowledge and intelligence about multimedia forms will lead to insightsin the multimedia information flows. Truly multimedia information systems willhold and provide access to the ever increasing data streams; handling informationin all its formats. In this context, multimedia can be described as all handling in acomputer of information in a pictorial, free language, and audio form.

Some obvious application patterns of multimedia are consumer electronics de-vices, the archiving of large digital multimedia repositories, e-document handlingof multimedia, digital delivery of information, digital publishing or multimedia ineducation, e-health and e-government. Multimedia handling is becoming an inte-gral part of workflows in many economic clusters. Integrated multimedia informa-tion analysis will be used for surveillance, forensic evidence gathering, and publicsafety. Multimedia knowledge generation and knowledge engineering is employedfor making pictures, film and audio archives accessible in cultural heritage andthe entertainment industry. Multimedia information systems and standards willbe used to efficiently store large piles of information: video from the laboratories,wearable video from official observations, and ultimately our own senses to becomeour autobiographic memory. Indexing and annotating multimedia to gain access toits semantics remains an important topic for anyone with a large pile of informa-tion. Multimedia presentation will replace the current graphical user interfaces fora much more effective information transfer, and so on.

Four urgent themes in multimedia computing

It is clear from our everyday life that information is overly present and yet under-employed. This holds also for digital multimedia information. The old idiom ofinformation handling largely settled by paper, the tape recorder, or the television issimply not good enough to be translated straightforward into the digital age. Thefollowing four main themes arise at the societal level.

The information handling is frequently fragmented by the distinctly differentinformation types of former days: image by pictures and maps, sound by recordingsand language by text. What is needed is to use the potential of the digital ageto combine information of different kinds: images, sound, text, speech, maps andvideo. Information integration by types is needed either in the analysis stage, thekeeping of information in a storage system, or in the presentation and interactionwith the user at each time using the most appropriated information channel for thecircumstances.

Much of the information is not easily accessible, at least by the content ofmultimedia information forms. Although we have learned to deal with all forms ofinformation instantaneously, such knowledge is not readily transferable to computerhandling of information. Gaining access to the meaning of pictures, sounds, speechand language is not a trivial task for a digital computer. It is hard to learn fromthe diversity of information flows without seriously studying methods of multimedialearning.

The information reaching the user is inflexibly packaged in forms not very dif-ferent from the one-modal communication of paper and records. In addition, the

Page 9: Multimedia Lecture Notes

9

information comes in enormous quantities, effectively reducing our ability to handlethe information stream with care. What are needed are information filtering, infor-mation exploration and effective interaction. Such can be achieved by personalizedand adaptive information delivery and by information condensation

The digital multimedia information is frequently poorly organized. This is dueto many different causes - ancient and new ones- such as the barriers that standardscreate, the heterogeneity between formal systems of terminology, the inflexibility ofinformation systems, the variable capacity of communication channels, and so on.What is needed is an integrated approach to break down the effective barriers ofmultimedia information handling systems.

Solutions

To provide solutions to the problems identified above a multi-disciplinary approachis required. To analyze the sensory and textual data, techniques from signal pro-cessing, speech recognition, computer vision, and natural language processing areneeded. To understand the information, knowledge engineering and machine learn-ing play an important role. For storing the tremendous amounts of data, databasesystems are required with high performance, the same holds for the network tech-nology involved. Apart from technology, the role of the user in multimedia is evenmore important than in traditional systems, hence visualization and human com-puter interfacing form another fundamental basis in multimedia.

In this course the focus is not on one of the underlying fields, but aims toprovide the necessary insight in these fields to be able to use them in buildingsolutions through multimedia information systems. To that end we decompose thedesign of the multimedia information system into four modules:

1. Import module: this module is responsible for acquiring the data and store itin the database with a set of relevant descriptors.

2. Data mining module: based on the descriptors this module aims at extractingnew knowledge from the large set of data in the database.

3. Interaction module: this module deals with the end-user who has a specificsearch task and includes the information visualization which helps the user inunderstanding the content of the information system.

4. Export module: this modules deals with the delivery of information to othersystems and external users.

Of course there is another important module for data storage which shouldprovide efficient access to large collections of data. For this lecture notes, we willassume that such a database management system is available, but in fact this is acomplete research field on its own.

In figure 1.1 the overview of the different modules and their interrelations areshown. This dataflow diagram is drawn conform the Gane-Sarson notation [41].The diagrams consist of three elements:

• Interfaces: these are drawn as rectangles, and denote an abstract interface toa process. Data can originate from an Interface.

• Processes: elements drawn as rounded rectangles. Each process must haveboth input and output dataflow(s).

• Datastores: drawn with open rectangles.

Page 10: Multimedia Lecture Notes

10 CHAPTER 1. INTRODUCTION

Database

Database Management System

Import User Interaction Export

Ontologies Data New Knowledge

Information Request Results

Selection Layout Constraints

Formatted Result

Data Preparation

Raw Data Contextual Knowledge

Data Mining

Interaction Processing

Formatting

Multimedia Data Items Layout Descriptions

Visualization

Data, Metadata

2D-screen Visualization

QuerySpec

SQL statement Query Result

Multimedia Information Systems

Multimedia Descriptors

Ditgital Multimedia

Data

Structured DataSet

Feedback

Multimedia Descriptions

Data Request

Figure 1.1: Overall architecture of a Multimedia Information System, giving ablueprint for the lecture notes.

• Dataflows: drawn as arrows going from source to target.

The Gane-Sarson method is a topdown method where each process can be de-tailed further, as long as the dataflows remain consistent with the overall architec-ture. In fact, in subsequent chapters the individual processing steps in the systemare further detailed and the dataflow pictures exhibit such consistency throughoutthe notes. Thus, the architecture gives a blueprint for the lecture notes.

Overview of the lecture notes

The lecture notes start of in chapter 2 by giving a general overview of differentdomains in which multimedia information systems are encountered and considerthe general steps involved in the process of putting data into the database. Weconsider the representation of the data, in particular for images, audio, video andtext in chapter 3. Data analysis is then explored in chapter 5. We then moveon to the specific domain of Video Information System and consider its indexingin chapter 6. The above modules comprise the import module. Having stored alldata and descriptors in the database allows to start the data mining module whichwill be considered in chapter 7. The visualization part of the interaction module isconsidered in chapter 8, whereas the actual interaction with the user is describedin chapter 9. Finally, in chapter 10 the data export module is explored.

Key terms

Every chapter ends with a set of key terms. Make sure that after reading eachchapter you know the meaning of each of these key terms and you know the examplesand methods associated with these key terms, especially those which are explainedin more detail during the lectures. Check whether you know the meaning after

Page 11: Multimedia Lecture Notes

11

each lecture in which they were discussed as subsequent lectures often assume theseterms are known.

Enjoy the course,

Marcel

Keyterms in this chapter

Multimedia information, Gane-Sarson dataflow diagrams, data import, data export,interaction

Page 12: Multimedia Lecture Notes

12 CHAPTER 1. INTRODUCTION

Page 13: Multimedia Lecture Notes

Chapter 2

Data, domains, andapplications

As indicated in the introduction multimedia information is appearing in all differentkinds of application in various domains. Here we focus on video documents as theirrichest form they contain visual, auditory, and textual information. In this chapterwe will consider how to analyze these domains and how to prepare the data forinsertion into the database. To that end we first describe, in section 2.1 differentdomains and the way video data is produced and used. From there we categorizethe data from the various applications in order to be able to select the right class oftools later (section 2.2). Then we proceed to the way the data is actually acquiredin section 2.3. The role of external knowledge is considered in section 2.4. We thenconsider in detail how a video document is created, as this forms the basis for laterindexing (section 2.5). Finally, we consider how this leads to a general frameworkwhich can be applied in different domains.

2.1 Data and applications

In the broadcasting industry the use of video documents is of course obvious. Largeamounts of data are generated in the studios creating news, films, soaps, sport pro-grams, and documentaries. In addition, their sponsors create significant amount ofmaterial in the form of commercials. Storing the material produced in a multimediainformation systems allows to reuse the material later. Currently, one of the mostimportant applications for the broadcasting industry is to do multi-channelling i.e.distributing essentially the same information via the television, internet and mo-bile devices. In addition interactive television is slowly, but steadily, growing e.g.allowing to vote on your different idol. More and more of the video documentsin broadcasting are created digitally, however the related information still is notdistributed alongside the programs and hence not always available for storing it inthe information system.

Whereas the above video material is professionally produced, we see lot of unpro-fessional videos being shot by people with their private video camera, their webcam,or more recently with their PDA or telephone equipped with a camera. The rangeof videos one can encounter here is very large, as cameras can be used for virtuallyeverything. However, in practice, many of the videos will mostly contain peopleor holiday scenes. The major challenge in consumer applications is organizing thedata in such a way that you can later find all the material you shot e.g. by location,time, persons, or event.

13

Page 14: Multimedia Lecture Notes

14 CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

Ontologies

Contextual Knowledge Raw Data

DataCompression

Multimedia Indexing

Data Digitization

Ontology Creation

Sensory Data

Digitized Sensory Data

ClassInformation

Data acquisition

Multimedia Descriptors

Data Preparation

DataRepresentation

DataFeatures DataSimilarity

Compressed Data

Digitally Produced Multimedia Data Other Data

Ditgital Multimedia

Data

Figure 2.1: Overview of the steps involved in importing data and domain knowledgeinto the database. DataRepresentation and DataAnalysis will be described in thenext chapters, here we focus on the top part of the picture.

Page 15: Multimedia Lecture Notes

2.2. CATEGORIZATION OF DATA 15

In education the use of video material is somewhat related to the creation of doc-umentaries in broadcasting, but has added interactive possibilities. Furthermore,you see and more and more lectures and scientific presentations being recorded witha camera and made accessible via the web. They can form the basis for new teachingmaterial.

For businesses the use of electronic courses is an effective way of reducing thetimeload for instruction. Next to this, videos are used in many cases for recordingbusiness meetings. Rather than scribing the meeting, action lists are sufficient asthe videos can replace the extensive written notes. Furthermore, it allows peoplewho were not attending the meeting to understand the atmosphere in which cer-tain decisions were made. Another business application field is the observation ofcustomers. This is of great importance for marketing applications.

Let’s now move to the public sector where among other reasons, but for a largepart due to september 11, there has been an increased interest in surveillance ap-plications guarding public areas and detecting specific people, riots and the like.What is characteristic for these kind of applications is that the interest is only ina very limited part of the video, but clearly one does not know beforehand whichpart contains the events of interest. Not only surveillance, but also cameras on cashmachines and the like provide the police with large amounts of material. In forensicanalysis, video is therefore also becoming of great importance. Another applicationin the forensic field is the detection and identification of various videos found onPC’s or the Internet containing material like child pornography or racism. Finally,video would be a good way of recording a crime scene for later analysis.

Somewhat related to surveillance is the use of video observation in health care.Think for example of a video camera in an old peoples home automatically identi-fying if someone falls down or has some other problem.

2.2 Categorization of data

Although all of the above applications use video, the nature of the videos is quitedifferent for the different applications. We now give some categorizations to makeit easier to understand the characteristics of the videos in the domain. A majordistinction is between produced and observed video.

Definition 1 (Produced video data) videos that are created by an author whois actively selecting content and where the author has control over the appearanceof the video.

Typical situations where the above is found is in the broadcasting industry. Mostof the programs are made according to a given format. The people and objects inthe video are known and planned.

For analysis purposes it is important to further subdivide this category intothree levels depending in which stage of the process we receive the data:

• raw data: the material as it is shot.

• edited data: the material that is shown in the final program

• recorded data: the data as we receive it in our system

Edited data is the richest form of video as it has both content and a layout.When appropriate actions are taken directly when the video is produced, manyindices can directly be stored with the data. In practice, however, this productioninfo is often not stored. In that case recorded data becomes difficult to analyze asthings like layout information have to be reconstructed from the data.

The other major category is formed by:

Page 16: Multimedia Lecture Notes

16 CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

Definition 2 (Observed video data) videos where a camera is recording somescene and where the author does not have means to manipulate or plan the content.

This kind of video found is found in most of the applications, and the most typ-ical examples are surveillance videos and meetings. However, also in broadcastingthis is found. E.g. in soccer videos the program as a whole is planned, but thecontent itself cannot be manipulated by the author of the document.

Two other factors concerning the data are important:

• quality of the data: what’s the resolution and signal-to-noise ratio of the data.

The quality of the video can vary significantly for the different applications,ranging from high-resolution, high quality videos in the broadcasting industry, tovery low quality data from small cameras incorporated in mobile phones.

A final important factor is the

• Application control : how much control does one have on the circumstancesunder which the data is recorded.

Again the broadcasting industry goes to extreme cases here. E.g. in filmsthe content is completely described in a script and lighting conditions are almostcompletely controlled, if needed enhanced using filters. In security applicationsthe camera can be put at a fixed position which we now. For mobile phones therecording conditions are almost arbitrary.

Finally, a video is always shot for some reason following [59] we define:

• Purpose the reason for which the video document is made being entertainment,information, communication, or data analysis.

2.3 Data acquisition

A lot of video information is already recorded in digital form and hence can easily betransported to the computer. If the video is shot in analog way a capture device hasto be used in the computer to digitize the sensory data. In both cases the result is astream of frames where each frame is typically an RGB-image. A similar thing holdsfor audio. Next to the audiovisual data there can be a lot of accompanying textualinformation like the teletext channel containing text in broadcasting, the script of amovie, scientific documents related to a presentation given, or the documents whichare subject of discussion at a business meeting.

Now let us make the result of acquisition of multimedia data more precise. Foreach modality the digital is a temporal sequence of fundamental units, which in itselfdo not have a temporal dimension. The nature of these units is the main factordiscriminating the different modalities. The visual modality of a video document isa set of ordered images, or frames. So the fundamental units are the single imageframes. Similarly, the auditory modality is a set of samples taken within a certaintime span, resulting in audio samples as fundamental units. Individual charactersform the fundamental units for the textual modality.

As multimedia datastreams are very large data compression is usually applied,except when the data can be processed directly and there is no need for storing thevideo. For video the most common compression standards are MPEG-1, MPEG-2and MPEG-4. For audio mp3 is the best known compression standard.

Finally, apart from the multimedia data, there is lot of factual informationrelated to the video sources, like e.g. the date of a meeting, the value at the stockmarket of the company discussed in a video, or viewing statistics for broadcast.

Page 17: Multimedia Lecture Notes

2.4. ONTOLOGY CREATION 17

2.4 Ontology creation

As indicated above, many information sources are used for a video. Furthermore,videos rarely occur in isolation. Hence, there is always a context in which the videohas to be considered. To that end for indexing and analysis purposes it is importantto make use of external knowledge sources. Examples are plenty. In news broadcastnews sites on the internet provide info on current issues, the CIA factbook containsinformation on current countries and presidents. Indexing film is greatly helped byconsidering the Internet Movie Database and so on. Furthermore, within a companyor institute local knowledge sources might be present that can help in interpretingthe data. For security applications a number of images of suspect people might beavailable.

All of the above information and knowledge sources have their own format andstyle of use. It is therefore important to structure the different knowledge sourcesinto ontologies and make the information elements instantiations of concepts in theontology. In this manner the meaning of a concept is uniquely defined, and canhence can be used in a consistent way.

2.5 Produced video documents

1

As a baseline for video we consider how video documents are created in a pro-duction environment. In chapter 6 we will then consider the indexing of recordedvideos in such an environment as in this manner all different aspects of a video willbe covered.

An author uses visual, auditory, and textual channels to express his or her ideas.Hence, the content of a video is intrinsically multimodal. Let us make this moreprecise. In [94] multimodality is viewed from the system domain and is definedas “the capacity of a system to communicate with a user along different types ofcommunication channels and to extract and convey meaning automatically”. Weextend this definition from the system domain to the video domain, by using anauthors perspective as:

Definition 3 (Multimodality) The capacity of an author of the video documentto express a predefined semantic idea, by combining a layout with a specific content,using at least two information channels.

We consider the following three information channels or modalities, within a videodocument:

• Visual modality : contains the mise-en-scene, i.e. everything, either naturallyor artificially created, that can be seen in the video document;

• Auditory modality : contains the speech, music and environmental sounds thatcan be heard in the video document;

• Textual modality : contains textual resources that describe the content of thevideo document;

For each of those modalities, definition 3 naturally leads to a semantic perspec-tive i.e. which ideas did the author have, a content perspective indicating whichcontent is used by the author and how it is recorded, and a layout perspective indi-cating how the author has organized the content to optimally convey the message.We will now discuss each of the three perspectives involved.

1This section is adapted from [131].

Page 18: Multimedia Lecture Notes

18 CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

2.5.1 Semantic index

The first perspective expresses the intended semantic meaning of the author. De-fined segments can have a different granularity, where granularity is defined as thedescriptive coarseness of a meaningful unit of multimodal information [30]. Tomodel this granularity, we define segments on five different levels within a semanticindex hierarchy. The first three levels are related to the video document as a whole.The top level is based on the observation that an author creates a video with acertain purpose. We define:

• Purpose: set of video documents sharing similar intention;

The next two levels define segments based on consistent appearance of layout orcontent elements. We define:

• Genre: set of video documents sharing similar style;

• Sub-genre: a subset of a genre where the video documents share similar con-tent;

The next level of our semantic index hierarchy is related to parts of the content,and is defined as:

• Logical units: a continuous part of a video document’s content consisting ofa set of named events or other logical units which together have a meaning;

Where named event is defined as:

• Named events: short segments which can be assigned a meaning that doesn’tchange in time;

Note that named events must have a non-zero temporal duration. A single imageextracted from the video can have meaning, but this meaning will never be perceivedby the viewer when this meaning is not consistent over a set of images.

At the first level of the semantic index hierarchy we use purpose. As we onlyconsider video documents that are made within a production environment the pur-pose of data analysis is excluded. Genre examples range from feature films, newsbroadcasts, to commercials. This forms the second level. On the third level are thedifferent sub-genres, which can be e.g. horror movie or ice hockey match. Examplesof logical units, at the fourth level, are a dialogue in a drama movie, a first quarterin a basketball game, or a weather report in a news broadcast. Finally, at the low-est level, consisting of named events, examples can range from explosions in actionmovies, goals in soccer games, to a visualization of stock quotes in a financial newsbroadcast.

2.5.2 Content

The content perspective relates segments to elements that an author uses to createa video document. The following elements can be distinguished [12]:

• Setting : time and place in which the video’s story takes place, can also em-phasize atmosphere or mood;

• Objects: noticeable static or dynamic entities in the video document;

• People: human beings appearing in the video document;

Typically, setting is related to logical units as one logical unit is taken in the samelocation. Objects and people are the main elements in named events.

Page 19: Multimedia Lecture Notes

2.5. PRODUCED VIDEO DOCUMENTS 19

2.5.3 Recording the scene

The appearance of the different content elements can be influenced by an authorof the video document by using modality specific style elements. For the visualmodality an author can apply different style elements. She can use specific colorsand lighting which combined with the natural lighting conditions defines the ap-pearance of the scene. Camera angles and camera distance can be used to define thescale and observed pose of objects and people in the scene. Finally, camera move-ment in conjunction with the movement of objects and people in the scene definesthe dynamic appearance. Auditory style elements are the loudness, rhythmic, andmusical properties. The textual appearance is determined by the style of writingand the phraseology i.e. the choice of words, and the manner in which somethingis expressed in words. All these style elements contribute to expressing an author’sintention.

2.5.4 Layout

The layout perspective considers the syntactic structure an author uses for the videodocument.

Upon the fundamental units an aggregation is imposed, which is an artifact fromcreation. We refer to this aggregated fundamental units as sensor shots, defined asa continuous sequence of fundamental units resulting from an uninterrupted sensorrecording. For the visual and auditory modality this leads to:

• Camera shots: result of an uninterrupted recording of a camera;

• Microphone shots: result of an uninterrupted recording of a microphone;

For text, sensor recordings do not exist. In writing, uninterrupted textual expres-sions can be exposed on different granularity levels, e.g. word level or sentence level,therefore we define:

• Text shots: an uninterrupted textual expression;

Note that sensor shots are not necessarily aligned. Speech for example can continuewhile the camera switches to show the reaction of one of the actors. There arehowever situations where camera and microphone shots are recorded simultaneously,for example in live news broadcasts.

An author of the video document is also responsible for concatenating the dif-ferent sensor shots into a coherent structured document by using transition edits.“He or she aims to guide our thoughts and emotional responses from one shot toanother, so that the interrelationships of separate shots are clear, and the transi-tions between sensor shots are smooth” [12]. For the visual modality abrupt cuts,or gradual transitions2, like wipes, fades or dissolves can be selected. This is im-portant for visual continuity, but sound is also a valuable transitional device invideo documents. Not only to relate shots, but also to make changes more fluid ornatural. For the auditory transitions an author can have a smooth transition usingmusic, or an abrupt change by using silence [12]. To indicate a transition in thetextual modality, e.g. closed captions, an author typically uses “>>>”, or differentcolors. They can be viewed as corresponding to abrupt cuts as their use is only toseparate shots, not to connect them smoothly.

The final component of the layout are the optional visual or auditory specialeffects, used to enhance the impact of the modality, or to add meaning. Overlayed

2A gradual transition actually contains pieces of two camera shots, for simplicity we regard itas a separate entity.

Page 20: Multimedia Lecture Notes

20 CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

Visual Auditory Textual

Layout

Content

Semantic Index

S O P

Sensor Shots

Transition Edits

Special Effects

Fundamental Units

S O PS O P

Named events

Logical units

Sub-genre

Genre

Video data

Purpose

Visual Auditory Textual

Layout

Content

Semantic Index

S O P

Sensor Shots

Transition Edits

Special Effects

Fundamental Units

S O PS O P

Named events

Logical units

Sub-genre

Genre

Video data

Purpose

Figure 2.2: A unifying framework for multimodal video indexing based on an au-thor’s perspective. The letters S, O, P stand for setting, objects and people. Anexample layout of the auditory modality is highlighted, the same holds for the oth-ers.

text, which is text that is added to video frames at production time, is also con-sidered a special effect. It provides the viewer of the document with descriptiveinformation about the content. Moreover, the size and spatial position of the textin the video frame indicate its importance to the viewer. “Whereas visual effectsadd descriptive information or stretch the viewer’s imagination, audio effects addlevel of meaning and provide sensual and emotional stimuli that increase the range,depth, and intensity of our experience far beyond what can be achieved throughvisual means alone” [12]. Note that we don’t consider artificially created contentelements as special effects, as these are meant to mimic true settings, objects, orpeople.

2.6 Discussion

Based on the discussion in the previous sections we come to a unifying multimodalvideo indexing framework based on the perspective of an author. This frameworkis visualized in figure 2.2. It forms the basis for the discussion of methods for videoindexing in chapter 6.

Although the framework is meant for produced video it is in fact a blueprint forother videos also. Basically, all of the above categories of videos can be mapped tothe framework. However, not in all cases, all modalities will be used, and maybethere is no layout as no editing has taken place and so on.

Keyterms in this chapter

ontologies, sensory data, digital multimedia data, data compression, purpose, genre,sub-genre, logical unit, named events, setting, objects, people, sensor shots, camerashots, microphone shots, text shots, transition edits, special effects.

Page 21: Multimedia Lecture Notes

Chapter 3

Data representation

3.1 Categorization of multimedia descriptions

To be able to retrieve multimedia objects with the same ease as we are used to whenaccessing text or factual data as well as being able to filter out irrelevant items inthe large streams of multimedia reaching us requires appropriate descriptions of themultimedia data. Although it might seem that multimedia retrieval is the trivialextension of text retrieval it is in fact far more difficult. Most of the data is of sensoryorigin (image, sound, video) and hence techniques from digital signal processing andcomputer vision are required to extract relevant descriptions. Such techniques ingeneral yield features which do not relate directly to the user’s perception of thedata, the so called semantic gap. More precisely defined as [128]:

The semantic gap is the lack of coincidence between the information thatone can extract from the sensory1 data and the interpretation that thesame data have for a user in a given situation.

Consequently, there are two levels of descriptions of multimedia content one oneither side of the semantic gap. In addition to the content description there arealso descriptions which are mostly concerned with the carrier of the information.Examples are the pixel size of an image, or the sampling rate of an audio fragment.It can also describe information like the owner of the data or the time the data wasgenerated. It leads to the following three descriptive levels [69]:

• perceptual descriptions: descriptions that can be derived from the data

• conceptual descriptions: descriptions that cannot be derived directly from thedata as an interpretation is necessary.

• non-visual/auditory descriptions: descriptions that cannot be derived fromthe data at all.

The non-visual/auditory descriptions areStandards for the exchange of multimedia information like MPEG-7 [88][89][80]

give explicit formats for descriptions attached to the multimedia object. However,these standards do not indicate how to find the values/descriptions to be used fora specific multimedia data object. Especially not how to do this automatically.

Clearly the semantic gap is exactly what separates the perceptual and conceptuallevel. In this chapter we will consider the representation of the different modalities

1In the reference this is restricted to visual data, but the definition is applicable to other sensorydata like audio as well.

21

Page 22: Multimedia Lecture Notes

22 CHAPTER 3. DATA REPRESENTATION

at the perceptual level. In later chapters we will consider indexing techniques whichare deriving descriptions at the conceptual level. If such indexing techniques can bedeveloped for a specific conceptual term, this term basically moves to the perceptualcategory as the system can generate such a description automatically without humanintervention.

3.2 Basic notions

Multimedia data can be described at different levels. The first distinction to bemade is between the complete data item like an entire video, a song, or a textdocument and subparts of the data which can attain different forms. A seconddistinction is between the content of the multimedia data and the layout. Thelayout is closely related to the way the content is presented to the viewer, listener,or reader.

Now let us take a closer look at what kind of subparts one can consider. This isdirectly related to the methods available for deriving those parts, and it has a closeresemblance to the different classes of descriptions considered.

Four different objects are considered:

• perceptual object : a subpart in the data which can be derived by weak segmen-tation, i.e. segmentation of the datafield based on perceptual characteristicsof the data.

• conceptual object : a part of the data which can be related to conceptualdescriptions and which cannot be derived from the data without an interpre-tation, the process of finding these objects is called strong segmentation.

• partition: the result of a partitioning of the datafield, which is not dependenton the data itself.

• layout object : the basic data elements which are structured and related bythe layout.

Examples of the above categories for the different modalities will be consideredin the subsequent sections. In the rest of the notes we will use :

• multimedia item : a full multimedia data item, a layout element, or an elementin the partioning of a full multimedia item.

• multimedia object is used it can be either a conceptual or perceptual object.

Hence, a multimedia item can be for example a complete video, an image, or anaudio CD, but also a shot in a video, or the upper left quadrant of an image. Amultimedia object can e.g. be a paragraph in a text, or a house in an image.

3.3 Audio representation

2

2The info in this section is mostly based on [78].

Page 23: Multimedia Lecture Notes

3.3. AUDIO REPRESENTATION 23

0 10 20 30 40 50 60−1

−0.5

0

0.5

1

time [ms]

ampl

itude

0 1 2 3 4 5 6 7 8−20

−10

0

10

20

30

40

50

frequency [kHz]

[dB

]

Figure 3.1: An example of a signal in the time-domain (top) and its spectrum(bottom).

3.3.1 Audio features

An audio signal is a digital signal in which amplitude is given as function of time.When analyzing audio one can consider the time-domain signal directly, but it canbe advantageous to consider the signal also on the basis of the frequencies in thesignal. The Fourier transform is a well known technique to compute the contributionof the different frequencies in the signal. The result is called the audio spectrum ofthe signal. An example of a signal and its spectrum are shown in figure 3.1.

We now consider audio features which can be derived directly from the signal.

• Average Energy: the average squared amplitude of the signal, an indicationof the loudness.

• Zero Crossing Rate: a measure indicating how often the amplitude switchesfrom positive to negative and vice versa.

• Rhythm: measures based on the pattern produced by emphasis and durationof notes in music or by stressed and unstressed syllables in spoken words.

• Linear Prediction Coefficients (LPC): measure of how well a sample in thesignal can be predicted based on previous samples.

In the frequency domain there are is also an abundance of features to compute.Often encountered features are:

• Bandwidth: the frequency range of the sound, in its simplest form the differ-ence between the largest and smallest non-zero frequency in the spectrum.

Page 24: Multimedia Lecture Notes

24 CHAPTER 3. DATA REPRESENTATION

• Fundamental frequency: the dominant frequency in the spectrum.

• Brightness (or spectral centroid): the is normalized sum of all frequenciestimes how much this frequency is present in the signal.

• Mel-Frequency Cepstral Coefficients (MFCC): a set of coefficients designedsuch that they correspond to how humans hear sound.

• Pitch: degree of highness or lowness of a musical note or voice.

• Harmonicity: degree to which the signal is built out of multiples of the fun-damental frequency.

• Timbre: characteristic quality of sound produced by a particular voice orinstrument (subjective).

3.3.2 Audio segmentation

An audio signal can be long and over time the characteristics of the audio signaland hence its features will change. Therefore, in practice one partitions the signalinto small segments of say a duration of 10 milliseconds. For each of those segmentsany of the features mentioned can then be calculated.

An audio signal can contain music, speech, silence, and other sounds (like carsetc.). The aim of weak segmentation of audio aims at decomposing the signal intothese four components. An important step is decomposing the signal based ondifferent ranges of frequency. A second important factor is harmonicity as thisdistinguishes music from other audio.

Strong segmentation of audio aims at detecting different conceptual objects likecars, or individual instruments in a musical piece. This will require to build modelsfor each of the different concepts.

3.3.3 Temporal relations

When I have two different audio segments A and B , there is a selected set oftemporal relations that can hold between A and B namely precedes, meets, overlaps,starts, equals, finishes, during and there inverses denoted by add i at the end ofthe name (if B precedes A the relation precedes i holds between B and A). Theseare known as the Allen’s relations [5]. As equals is symmetric there are 13 suchrelations in total. The relations are such that for any two intervals A and B oneand exactly one of the relations hold.

3.4 Image representation

3.4.1 Image features

A digital image is an array of pixels, where each pixel has a color. The basicrepresentation for the color of the pixel is the triple R(ed), G(reen), B(lue). Thereare however many other color spaces which are more appropriate in certain tasks.We consider HSV and Lab here.

A first thing to realize is that the color of an object is actually a color spectrum,indicating how much a certain wavelength is present (white light contains an equalamount of all wavelengths). This is the basis for defining HSV. To be precise thethree components of HSV are as follows: H(ue) is the dominant wavelength in thecolor spectrum. It is what you typically mean when you say the object is red, yellow,blue, green, purple etc. S(aturation) is a measure for the amount of white in the

Page 25: Multimedia Lecture Notes

3.4. IMAGE REPRESENTATION 25

spectrum. It defines the purity of a color distinguishing for example signal-red frompink. Finally, the V(olume) is a measure for the brightness or intensity of the color.This make the difference between a dark and a light color if they have the same Hand S values.

Lab is another color space that is used often. The L is similar to the V in HSV.The a and b are similar to H and V. The important difference is that in the Labspace the distance between colors in the color space is approximately equal to theperceived difference in the colors. This is important in defining similarity see sectionchapter 4.

In the above, the color is assigned to individual pixels. All these colors willgenerate patterns in the image, which can be small, or large. These patterns aredenoted with the general term texture. Texture is of great importance in classifyingdifferent materials like the line-like pattern in a brick wall, or the dot-like patternof sand.

In an image there will be in general different colors and/or textures. This meansthere will be many positions in the image where there is a significant change in imagedata, in particular a change in color or texture. These changes form (partial) linescalled edges.

The above measure give a local description of the data in the image. In manycases global measures, summarizing the information in the image are used. Mostcommonly these descriptions are in the form of color histograms counting how manypixels have a certain color. It can however, also, be a histogram on the directionsof the different edge pixels in the image.

An image histogram looses all information on spatial configurations of the pixels.If I have a peak in the histogram at the color red, the pixels can be scattered allaround the image, or it can be one big red region. Color coherence vectors are analternative representation which considers how many pixels in the neighborhoodhave the same color. A similar representation can be used for edge direction.

The above histograms and coherence vectors can be considered as summaries ofthe data, they are non-reversible. I cannot find back the original image if I havethe histogram. The following two descriptions do have that property.

The Discrete Cosine Transform (DCT) is a transform which takes an image andcomputes it frequency domain description. This is exactly the same as consideredfor audio earlier, but now in two dimensions. Coefficients of the low frequencycomponents given a measure of the amount of large scale structure where high-frequency information gives information on local detailed information. The (Haar)wavelet transform is a similar representation, which also takes into account wherein the image the specific structure is found.

3.4.2 Image segmentation

For images we can also consider the different ways of segmenting an image. Apartition decomposes the image into fixed regions. Commonly this is either a fixedset of rectangles, or one fixed rectangle in the middle of the image, and a furtherpartition of the remaining space in a fixed number of equal parts.

Weak segmentation boils down to grouping pixels in the image based on a ho-mogeneity criterion on color or texture, or by connecting edges. It leads to a de-composition of the image where each region in the decomposition has a uniformcolor or texture.

For strong segmentation, finding specific conceptual objects in the image, weagain have to rely on models for each specific object, or a large set of hand-annotatedexamples.

Page 26: Multimedia Lecture Notes

26 CHAPTER 3. DATA REPRESENTATION

3.4.3 Spatial relations

If I have two rectangular regions we can consider the Allen’s relations separatelyfor the x- and y- coordinate. However, in general regions have arbitrary shape.There are various 2D spatial relations that I can consider. Relations like left-of, above, surrounded-by, and nearest neighbor are an indication of the relativepositions of regions in the image. Constraints like inside, enclosed-by are indicationsof topological relations between regions.

3.5 Video representation

3.5.1 Video features

As a video is a set of temporally ordered images its representation clearly sharesmany of the representations considered above for images. However, the addition ofa time component also adds many new aspects.

In particular we can consider the observed movement of each pixel from oneframe to another called the optic flow, or we can consider the motion of individualobjects segmented from the video data.

3.5.2 Video segmentation

A partition of an image can be any combination of a temporal and spatial partition.For weak segmentation we have, in addition to color and texture based grouping ofpixels, motion based grouping which groups pixels if the have the same optic flowi.e. move in the same direction with the same speed. Strong segmentation requiresagain object models. The result of either weak or strong segmentation is called aspatio-temporal object.

For video there is one special case of weak segmentation which is temporalsegmentation. Thus, the points in time are detected where there is a significantchange in the content of the frame. This will be considered in a more elaborateform later.

3.5.3 Spatio-temporal relations

Spatio-temporal relations are clearly a combination of spatial and temporal rela-tions. One should note, however, that in a video spatial relations between twoobjects can vary over time. Two objects A and B can be in the relation A left-ofB at some point in time, while the movement of A and B can yield the relation Bleft-of A at a later point in time.

3.6 Text representation

3.6.1 Text features

The basic representation for text is the so called bag-of-words approach. In thisapproach a kind of histogram is made indicating how often a certain word is presentin the text. This histogram construction is preceded by a stop word eliminationstep in which words like the, in, etc. are removed. One also performs stemming onthe words bringing each word back to its base. E.g. “biking” will be reduced to theverb “to bike”.

The bag-of-words model commonly used is the Vector Space Model [116]. Themodel is based on linear algebra. A document is modeled as a vector of words whereeach word is a dimension in Euclidean space. Let T = {t1, t2, . . . , tn} denote the

Page 27: Multimedia Lecture Notes

3.6. TEXT REPRESENTATION 27

D1

T1

D2

T2

D3

T3

D4

T4

Dm

Tn

1 3 40

2 0 00

0 1 00

0 0 21

Figure 3.2: A Term Document matrix.

set of terms in the collection. Then we can represent the terms dTj in document dj

as a vector ~x = (x1, x2, . . . , xn) with:

xi ={

tij if ti ∈ dTj ;

0 if ti 6∈ dTj .

(3.1)

Where tij represents the frequency of term ti in document dj . Combining all doc-ument vectors creates a term-document matrix. An example of such a matrix isshown in figure 3.2.

Depending on the context, a word has an amount of information. In an archiveabout information retrieval, the word ’retrieval’ does not add much informationabout this document, as the word ’retrieval’ is very likely to appear in manydocuments in the collection. The underlying rationale is that words that occurfrequently in the complete collection have low information content. However, if asingle document contains many occurrences of a word, the document is probablyrelevant. Therefore, a term can be given a weight, depending on its informationcontent. A weighting scheme has two components: a global weight and a localweight. The global importance of a term is indicating its overall importance in theentire collection, weighting all occurrences of the term with the same value. Localterm weighting measures the importance of the term in a document. Thus, the valuefor a term i in a document j is L(i, j) ∗G(i), where L(i, j) is the local weighting forterm i in document j and G(i) is the global weight. Several different term weightingschemes have been developed. We consider the following simple form here knownas Inverse Term Frequency Weigthing. The weight wi

j of a term ti in a document jis given by

wij = tij ∗ log(

N

ti∗) (3.2)

where N is the number of documents in the collection and ti∗ denotes the totalnumber of times word ti occurs in the collection. The logarithm dampens the effectof very high term frequencies.

Going one step further one can also consider the co-occurrence of certain wordsin particular which words follow each other in the text. If one applies this to alarge collection of documents to be used in analyzing other documents it is called abi-gram language model. It gives the probability that a certain word is followed byanother word. It is therefore also an instantiation of a Markov model. When threeor more general n subsequent words are used we have a 3-gram or n-gram languagemodel.

Page 28: Multimedia Lecture Notes

28 CHAPTER 3. DATA REPRESENTATION

3.6.2 Text segmentation

Different parts of a document may deal with different topics and therefore it canbe advantageous to partition a document into partitions of fixed size for which theword histogram is computed.

A useful technique, which can be considered an equivalent of weak segmentation,is part-of-speech tagging [79]. In this process each word is assigned the proper classe.g. verb, noun, etc. A simple way of doing this is by making use of a dictionaryand a bi-gram language model. One can also take the tagged result and find largerchunks as aggregates of the individual words. This technique is known as chunking[1].

A more sophisticated, but less general approach, is to generate a grammar forthe text and use a parser to do the part of speech tagging. It is, however, verydifficult to create grammars which can parse an arbitrary text.

3.6.3 Textual relations

As text can be considered as a sequence of words, the order in which words areplaced in the text yields directly a relation ”precedes”. This is similar to the Allen’srelations for time, but as words cannot overlap, there is no need for the otherrelations.

In the case where also the layout of the text is taking into account i.e. how itis printed on paper, many other relations will appear as we can now consider thespatial relations between blocks in the documents.

3.7 Other data

As indicated earlier, next to the multimodal stream of text, audio, and video thereis also other related information which is not directly part of the video document.This is often factual data. For these we use the common categorization into [156]:

• nominal data: these values are taken from a selected set of symbols, where norelation is supposed between the different symbols.

• ordinal data: these values are taken from a selected set of symbols, where anordering relation is supposed between the different symbols.

• interval data: quantities measured in fixed and equal units.

• ratio data: quantities for which the measurement scheme inherently defines azero point.

Keyterms in this chapter

Perceptual metadata, conceptual metadata, non-visual/auditory metadata, multime-dia item, multimedia object, partitioning, perceptual object, weak segmentation, con-ceptual object, strong segmentation, audio spectrum, Allen’s relations, color space,texture, edges, spatial relations, topological relations, optic flow, spatio-temporal ob-ject, n-gram, part-of-speech tagging, chunking

Page 29: Multimedia Lecture Notes

Chapter 4

Similarity

When considering collections of multimedia items, it is not sufficient to describeindividual multimedia items. It is equally important to be able to compare differentmultimedia items. To that end we have to define a so called (dis)similarity functionS to compare two multimedia items I1 and I2. The function S measures to whatextent I1 and I2 look or sound similar, to what extent they share a common style,or to what extent they have the same interpretation. Thus, n general we distinguishthree different levels of comparison:

• Perceptual similarity

• Layout similarity

• Semantic similarity

Each of these levels will now be described.

4.1 Perceptual similarity

Computing the dissimilarity of two multimedia items based on their data is mostlydone by comparing their feature vectors. For an image this can for example be aHSV color histogram, but it can also be a HSV histogram followed by a set of valuesdescribing the texture in the image.

One dissimilarity function between the Euclidean distance. As not all elements inthe vector might be equally important the distances between the individual elementscan be weighted. To make the above more precise let F 1 = {f1

i }i=1,...,n and F 2 ={f2

i }i=1,...,n be the two vectors describing multimedia item I1 and I2 respectively.Then the dissimilarity SE according to weighted Euclidean distance is given by:

SE(F1, F2) =

√√√√ n∑i=1

wi(f2i − f1

i )2

with ~w = {wi}i=1,n a weighting vector.For color histograms (or any other histogram for that matter) the histogram

intersection is also used often to denote dissimilarity. It does not measure Euclideandistance but takes the minimum of the two entries. Using the same notation asabove we find:

S∩(F1, F2) =n∑

i=1

min (f2i , f1

i )

29

Page 30: Multimedia Lecture Notes

30 CHAPTER 4. SIMILARITY

In [133] it is shown that it is advantageous for computing the similarity that onecomputes the cumulative histogram F rather than the regular histogram. This isdue to the fact that the use of cumulative histograms reduces the effect of having tolimit the number of different bins in a histogram. The histograms entries are thencomputed as:

F (m) =m∑

k=0

Fk,

where F is the regular histogram. After computing the cumulative histogram,histogram intersection can be used to compute the similarity between the two his-tograms.

For comparing two texts it is common to compute the similarity rather thandissimilarity. Given the vector-space representation the similarity between two textvectors ~q and ~d both containing n elements is defined as:

S = ~q · ~d , (4.1)

where · denotes the inner product computed as

~q · ~d =n∑

i=1

~qi ∗ ~di

If you would consider the vector as a histogram indicating how many times a certainword occurs in the document (possibly weighted by the importance of the term) itboils down to histogram multiplication. Hence, when a terms occurs often in bothtexts the contribution of that term to the value of the innerproduct will be high.

A problem with the above formulation is the fact that larger documents willcontain more terms and hence are on average more similar than short pieces oftext. Therefore, in practice the vectors are normalized by their length

||~q|| =

√√√√ n∑i=1

~q2i

Leading to the so called Cosine Similarity.

S =~q · ~d

||~q||||~d||

This measure is called the cosine similarity as it can be shown that it equals thecosine of the angle between the two length normalized vectors.

Feature vectors and similarity will play a crucial role in interactive multimediaretrieval which will be considered in chapter 9.

4.2 Layout similarity

To measure the (dis)similarity of two multimedia items based on their layout acommon approach is to transform the layout of the item to a string containing theessential layout structure. For simplicity we consider the layout of a video as thisis easily transformed to a 1D-string, it can however be extended to 2 dimensionse.g. when comparing the layout of two printed documents. When the layouts oftwo multimedia items are described using strings they can be compared by makinguse of the edit-distance in [2] defined as follows:

Page 31: Multimedia Lecture Notes

4.3. SEMANTIC SIMILARITY 31

C_1

C_2

1 2

3 c1

c2

Concept

Instantiation

Figure 4.1: Example of the semantic distance of two instantiations c1, c2 of con-cepts C1 and C2 organized in a tree structure. In this case the distance would beequal to 3.

Definition 4 (Edit Distance) Given two strings A : a1, a2, a3, ..., an and B =b1, b2, b3, ..., bm over some alphabet Σ, a set of allowed edit operations E and aunit cost for each operation, the edit distance of A and B is the minimum cost totransform A into B by making use of edit operations.

As an example let’s take two video sequences and consider the transitions cut(c), wipe (w), and dissolve (d) and let a shot be denoted by symbol s. Hence,Σ = {c, w, d, s}. Two example sequences could now look like

A = scswsdscs

B = sdswscscscs

Now let E = {insertion,deletion,substitution} and for simplicity let each operationhave equal cost. To transform A into B we twice have to do a substitution to changethe effects c and d used and two insert operations to add the final cs. Hence, inthis case the edit distance would be equal to 4.

4.3 Semantic similarity

When multimedia items are described using concepts which are derived from anontology, either annotated by the user or derived using pattern recognition, we cancompare two multimedia items based on their semantics.

When the ontology is organized as a hierarchical tree an appropriate measure tomeasure the semantic distance i.e. the dissimilarity of two instantiations c1, c2 ofthe concepts C1 and C2 is to take the number of steps in the tree one has to followto move from concept C1 to C2. If the ontology is organized as a set of hierarchicalviews this leads to a vector where each element in the vector is related to one ofthe views. A very simple example is shown in figure figure 4.1.

Keyterms in this chapter

(dis)similarity function, histogram intersection, Euclidean distance, histogram cross-correlation, cosine distance,edit distance,semantic distance

Page 32: Multimedia Lecture Notes

32 CHAPTER 4. SIMILARITY

Page 33: Multimedia Lecture Notes

Chapter 5

Multimedia Indexing Tools

5.1 Introduction

In the previous chapter we have made a first step in limiting the size of the semanticgap, by representing multimedia data in terms of various features and similarities.In particular, we have considered features which are of the perceptual class. Inthis chapter we will consider general techniques to use these features to find theconceptual label of a multimedia object by considering its perceptual features.

The techniques required to do the above task are commonly known as patternrecognition, where pattern is the generic term used for any set of data elements orfeatures. The descriptions in this chapter are taken for the largest part from theexcellent review of pattern recognition in [57].

5.2 Pattern recognition methods

Many methods for pattern recognition exist. Most of the methods fall into one ofthe four following categories:

• Template matching : the pattern to be recognized is compared with a learnedtemplate, allowing changes in scale and pose;

This simple and intuitive method can work directly on the data. For imagesa template is a usually a small image (let’s say 20x20 pixels), for audio is ita set of samples. Given a set of templates in the same class, one templaterepresenting the class is computed, e.g. by pixelwise averaging. In its simplestform any new pattern is compared pixelwise (for images) or samplewise (foraudio) to the set of stored templates. The new pattern is then assigned to theclass for which the correlation between the templates is highest.

In practice template matching becomes more difficult as one cannot assumethat two templates to be compared are near exact copies of one another. Animage might have a different scale, the object in the image might have adifferent pose, or the audio template might have a different loudness. Hence,substantial preprocessing is required before template matching can take place.Invariant features can help in this problem (see section 5.5).

• Statistical classification: the pattern to be recognized is classified based onthe distribution of patterns in the space spanned by pattern features;

• Syntactic or structural matching : the pattern to be recognized is compared toa small set of learned primitives and grammatical rules for combining primi-tives;

33

Page 34: Multimedia Lecture Notes

34 CHAPTER 5. MULTIMEDIA INDEXING TOOLS

ClassInformation

Multimedia Descriptors

DataFeatures

Model Building Classification

Decision Trees Decision Boundaries

Templates .....

DataSimilarity

Classifier Learning

TestData Features

TestData Similarity

Multimedia Indexing

Figure 5.1: Overview of the multimedia indexing steps.

Page 35: Multimedia Lecture Notes

5.3. STATISTICAL METHODS 35

For applications where the patterns have a apparent structure these methodsare appealing. Thew allow the introduction of knowledge on how the patternsin the different classes are built from the basic primitives.

As an example consider graphical symbols. Every symbol is built from lines,curves, and corners, which are combined into more complex shapes likessquares and polygons. Symbols can be distinguished by looking how theprimitives are combined into creating the symbol.

A major disadvantage of such methods is that it requires that all primitives inthe pattern are detected correctly, which is not the case if the data is corruptedby noise.

• Neural networks: the pattern to be recognized is input to a network whichhas learned nonlinear input-output relationships;

Neural networks mimic the way humans recognize patterns. They can be viewedas massively parallel computing systems consisting of an extremely large numberof simple processors with many interconnections. In the human brain those simpleprocessors are called neurons, when simulated on a computer one calls them per-ceptrons. In both cases, the processors have a set of weighted inputs and “fire” ifall inputs are above a certain threshold.

To train a neural network, input patterns are fed to the system and the expectedoutput is defined. Then the weights for the different connections are automaticallylearned by the system. In most systems there are also a lot of perceptrons whichare neither connected to the input or output, but are part of the so-called hiddenlayer. Such a neural network is called a multi-layer perceptron.

For specific cases neural networks and statistical classifiers coincide. Neuralnetworks are appealing as they can be applied in many different situations. It is,however, difficult to understand why a neural network assigns a certain class to aninput pattern.

Examples of the above methods are found throughout the lecture notes. The sta-tistical methods are the most commonly encountered ones, they will be consideredin more detail next.

5.3 Statistical methods

We consider methods here which assume that the patterns to be recognized aredescribed as a feature vector, thus every pattern can be associated with a specificpoint in the feature space. In addition a distance function should be provided,which in our case would be a similarity function as defined in chapter 4. As anexample consider the following very simple classification problem: a set of imagesof characters (i,o,a) for which only two features are calculated namely the widthand the height of the bounding box of the character, and similarity is defined asthe Euclidean distance.

Before we start describing different methods, let us first consider the generalscheme for building classifiers. The whole process is divided into two stages:

• Training : in this stage the system builds a model of the data and yields aclassifier based on a set of training patterns for which class information isprovided by a ”supervisor”.

In the above a supervisor is defined. In literature those methods are there-fore called supervised classification methods. The process is as follows. Everypattern is preprocessed to remove noise etc. Then a set of features are calcu-lated, and a classifier is learned through a statistical analysis of the dataset.

Page 36: Multimedia Lecture Notes

36 CHAPTER 5. MULTIMEDIA INDEXING TOOLS

i

o

i

o

71 4

2 38

Figure 5.2: Example confusion table.

To improve the results, the system can adjust the preprocessing, select thebest features, or try other features.

• Testing : patterns from a test set are given to the system and the classifieroutputs the optimal label.

In this process the system should employ exactly the same preprocessing stepsand extract the same features as in the learning phase.

To evaluate the performance of the system the output of the classifier for thetest patterns is compared to the label the supervisor has given to the test patterns.This leads to a confusion matrix, where one can see how often patterns are confused.In the above example it would for example indicate how many “i’s” were classifiedas being part of the class “o”. An example confusion table is shown in figure 5.2.

After testing the performance of the system is known and the classifier is usedfor classifying unknown patterns into their class.

To make things more precise. Let the c categories be given by ω1, ω2, ...., ωc.Furthermore, let the vector consisting of n features be given as ~x = (x1, x2, ..., xn).

A classifier is a system that takes a feature vector ~x and assigns to it the optimalclass ωi. The confusion matrix is the matrix C where the elements cij denote thenumber of elements which have true class ωi, but are classified as being in class ωj .

Conceptually the most simple classification scheme is

• k-Nearest Neighbor : assigns a pattern to the majority class among the kpatterns with smallest distance in feature space;

For this method, after selection of the relevant features a distance measure dhas to be defined. In principle, this is the similarity function described in chapter 3.Very often it is simply Euclidean distance in feature space. Having defined d,classification boils down to finding the nearest neighbor(s) in feature space. In 1-nearest neighbor classification the pattern ~x is assigned to the same class the nearestneighbor has. In k-nearest neighbors one uses majority voting on the class labels ofthe k points nearest in space. An example is shown in figure 5.3.

The major disadvantage of using k-nearest neighbors is that it is computationallyexpensive to find the k-nearest neighbors if the number of data objects is large.

In statistical pattern recognition a probabilistic model for each class is derived.Hence, one only has to compare a new pattern with one model for each class.The crux of statistical pattern recognition is then to model the distribution offeature values for the different classes i.e. giving a way to compute the conditionalprobability P (~x|ωi).

• Bayes Classifier : assigns a pattern to the class which has the maximum esti-mated posterior probability;

Page 37: Multimedia Lecture Notes

5.3. STATISTICAL METHODS 37

Feat 1

Feat 2

Feat 1

Feat 2

Figure 5.3: Illustration of classification based on k-nearest neighbors.

Feat 1

Feat 2Classifier with minimum error

Feat 1

Feat 2Classifier with minimum error

Figure 5.4: Example of a 2D linear separator.

Thus, assign input pattern ~x to class ωi if

P (ωi|~x) > P (ωj |~x) for all j 6= i (5.1)

When feature values are be expected to be normally distributed (i.e. a Gaussiandistribution) the above can be used to find optimal linear boundaries in the featurespace. An example is shown in figure 5.4.

Note, that in general we do not know the parameters of the distribution (themean µ and the standard deviation σ in the case of the normal distribution). Thesehave to be estimated from the data. Most commonly, the mean and standarddeviation of the training samples in each class are used.

If only a limited set of samples per class are available a Parzen estimator canbe used. In this method a normal distribution is placed at every sample with fixedstandard deviation. The probability of a sample to belong to the class is computedas a weighted linear combination of all these distributions. The weight is directlyproportional to the distance of the new sample to the individual samples in theclass.

In the Bayes classifier all features are considered at once, which is more accurate,but also more complicated than using features one by one. The latter leads to the

• Decision Tree: assigns a pattern to a class based on a hierarchical division offeature space;

To learn a decision tree a feature and a threshold are selected which give adecomposition of the training set into two parts such that the one part containsall elements for which the value is smaller than the threshold, and the other partthe remaining patterns. Then for each of the parts the process is repeated till thepatterns into a part are assumed to be all of the same class. All the decisions made

Page 38: Multimedia Lecture Notes

38 CHAPTER 5. MULTIMEDIA INDEXING TOOLS

Feat 1

Feat 2

Hierarchical division

of space

Feat 1

Feat 2

Hierarchical division

of spaceFeat 2

Hierarchical division

of space

Figure 5.5: Example of a hierarchical division of space based on a decision tree.

can be stored in a tree, hence the name. The relation between a decision tree infeature space is illustrated in figure 5.5.

At classification time a pattern is taken and for the feature in the root nodeof the decision tree, the value of the corresponding feature is compared to thethreshold. If the value is smaller the left branch of the tree is followed, the rightbranch is followed otherwise. This is repeated till a leaf of the tree is reached, whichcorresponds to a single class.

All of the above models assume that the features of an object remain fixed.For data which has a temporal dimension this is often not the case, when timeprogresses the features might change. In such cases it is needed to consider modelswhich explicitly take the dynamic aspects into account.

The Hidden Markov Model is a suitable tool for describing such time-dependentpatterns which can in turn be used for classification. It bears a close relation to theMarkov model considered in chapter 3.

• Hidden Markov Model (HMM): assigns a pattern to a class based on a sequen-tial model of state and transition probabilities [79, 109];

Let us first describe the components which make up a Hidden Markov Model.

1. A set S = s1, ...., sm of possible states in which the system can be.

2. A set of symbols V which can be output when the system is in a certain state

3. The state transition probabilities A indicating the conditional probability thatat time t+1 it moves to state si if before it was in state sj : p(qt+1 = si|qt = sj)

4. The probabilities that a certain symbol in V is output when the system is ina certain state.

5. Initial probabilities that the system starts in a certain state.

An example of a HMM is shown in figure 5.6.There are two basic tasks related to the use of HMM for which efficient methods

exist in literature:

1. Given a sequential pattern how likely is that it is generated by some givenHidden Markov Model?

2. Given a sequential pattern and a Hidden Markov Model what is the mostlikely sequence of states the system went through?

To use HMMs for classification we first have to find models for each class. Thisis often done in a supervised way. From there all the probabilities required are

Page 39: Multimedia Lecture Notes

5.4. DIMENSIONALITY REDUCTION 39

State1

State4

State5State3State2

p23p12

p45

p34

p14

State1

State4

State5State3State2

p23p12

p45

p34

p14

Figure 5.6: Example of a Hidden Markov Model.

a b c

Figure 5.7: An example of dimension reduction, from 2 dimensions to 1 dimension.(a) Two-dimensional datapoints. (b) Rotating the data so the vectors responsiblefor the most variation are perpendicular. (c) Reducing 2 dimensions to 1 dimensionso that the dimension responsible for the most variation is preserved.

estimated by using the training set. Now to classify a pattern task 1 mentionedabove is used to find the most likely model and thus the most likely class. Task 2can be used to take a sequence and classify each part of the sequence with it mostlikely state.

Many variations of HMMs exists. For example, in the above description discretevariables were used. One can also use continuous variables leading to continuousobservation density Markov models. Another extension are product HMM wherefirst an HMM is trained for every individual feature, the results of which are thencombined in a new HMM for integration.

5.4 Dimensionality reduction

In many practical pattern recognition problems the number of features is very highand hence the feature space is of a very high dimension. A 20-dimensional featurespace is hard to imagine and visualize, but is not very large in pattern recognition.Luckily enough there is often redundant information in the space and one can reducethe number of dimensions to work with. One of the best known methods for thispurpose is Principal Component Analysis (PCA).

Explaining the full details of the PCA is beyond the scope of the lecture notes,so we explain its simplest form, reducing a 2D feature space to a 1D feature space.See figure figure 5.7.

The first step is to find the line which best fits the data. Then every datapointis projected onto this line. Now, to build a classifier we only consider how thedifferent categories can be distinguished along this 1D line. Much simpler than theequivalent 2D problem. For higher dimensions the principle is the same, but insteadof a line one can now also use the best fitting plane which corresponds to using 2 ofthe principle components. In fact, the number of components in component analysis

Page 40: Multimedia Lecture Notes

40 CHAPTER 5. MULTIMEDIA INDEXING TOOLS

is equal to the dimension of the original space. By leaving out one or more of theleast important components, one gets the principal components.

5.5 Invariance

Selecting the proper classifier is important in obtaining good results, but findinggood features is even more important. This is a hard task as it depends on boththe data and the classification task. A key notion in finding the proper features isinvariance defined as [128]:

A feature f of t is invariant under W if and only if ft remains the same regardlessthe unwanted condition expressed by W ,

t1W∼ t2 =⇒ ft1 = ft2 (5.2)

To illustrate, consider again the simple example of bounding boxes of characters.If I want to say something about the relation between the width w and height h ofthe bounding box I could consider computing the difference w − h, but if I wouldscale the picture by a factor 2 I would get a result which is twice as large. If,however, I would take the aspect ratio w/h it would be invariant under scaling.

In general, we observe that invariant features are related to the intrinsic proper-ties of the element in the image or audio. An object in the image will not change ifI use a different lamp. A feature invariant for such changes in color would be goodfor classifying the object. On the other hand, the variant properties do have a greatinfluence on how I perceive the image or sound. A loud piece of music might be farmore annoying than hearing the same piece of music at normal level.

Keyterms in this chapter

Classification, training, testing, confusion matrix, principal component analysis, k-nearest neighbor, template matching, Bayes classifier, decision tree, Hidden MarkovModel, invariance

Page 41: Multimedia Lecture Notes

Chapter 6

Video Indexing

1

6.1 Introduction

For browsing, searching, and manipulating video documents, an index describingthe video content is required. It forms the crux for applications like digital librariesstoring multimedia data, or filtering systems [95] which automatically identify rele-vant video documents based on a user profile. To cater for these diverse applications,the indexes should be rich and as complete as possible.

Until now, construction of an index is mostly carried out by documentalists whomanually assign a limited number of keywords to the video content. The specialistnature of the work makes manual indexing of video documents an expensive andtime consuming task. Therefore, automatic classification of video content is neces-sary. This mechanism is referred to as video indexing and is defined as the processof automatically assigning content-based labels to video documents [48].

When assigning an index to a video document, three issues arise. The first isrelated to granularity and addresses the question: what to index, e.g. the entiredocument or single frames. The second issue is related to the modalities and theiranalysis and addresses the question: how to index, e.g. a statistical pattern classifierapplied to the auditory content only. The third issue is related to the type of indexone uses for labelling and addresses the question: which index, e.g. the names ofthe players in a soccer match, their time dependent position, or both.

Which element to index clearly depends on the task at hand, but is for a largepart also dictated by the capabilities of the automatic indexing methods, as well ason the amount of information that is already stored with the data at productiontime. As discussed in chapter 2 one of the most complex tasks is the interpretation ofa recording of a produced video as we have to reconstruct the layout and analyze thecontent. If, however, we are analyzing the edited video with all layout informationas well as scripts are available in for example MPEG-7 format (see chapter 10) thelayout reconstruction and a lot of indexing is not needed and one can continue tofocus on the remaining indexing tasks.

Most solutions to video indexing address the how question with a unimodalapproach, using the visual [27, 45, 102, 134, 141, 165, 169], auditory [35, 43, 82, 99,100, 104, 157], or textual modality [17, 52, 170]. Good books [40, 51] and reviewpapers [15, 18] on these techniques have appeared in literature. Instead of usingone modality, multimodal video indexing strives to automatically classify (piecesof) a video document based on multimodal analysis. Only recently, approaches

1This chapter is adapted from [131].

41

Page 42: Multimedia Lecture Notes

42 CHAPTER 6. VIDEO INDEXING

Reconstructed layout

Detected people

Detected objects

Detected setting

Video data

Video segmentation

Figure 6.1: Data flow in unimodal video document segmentation.

using combined multimodal analysis were reported [4, 8, 34, 55, 92, 105, 120] orcommercially exploited, e.g. [28, 108, 151].

One review of multimodal video indexing is presented in [153]. The authorsfocus on approaches and algorithms available for processing of auditory and visualinformation to answer the how and what question. We extend this by addingthe textual modality, and by relating the which question to multimodal analysis.Moreover, we put forward a unifying and multimodal framework. Our work shouldtherefore be seen as an extension to the work of [15, 18, 153]. Combined they forma complete overview of the field of multimodal video indexing.

The multimodal video indexing framework is defined in section section 2.5. Thisframework forms the basis for structuring the discussion on video document seg-mentation in section 6.2. In section 6.3 the role of conversion and integration inmultimodal analysis is discussed. An overview of the index types that can be dis-tinguished, together with some examples, will be given in section 6.4. Finally, insection 6.5 we end with a perspective on open research questions.

As indicated earlier we focus here on the indexing of a recording of producedand authored documents. Hence, we start off form the recorded datastream withoutmaking use of any descriptions that could have been added at production time.Given that this is the most elaborate task many of the methods are also applicablein the other domains that we have considered in chapter 2.

6.2 Video document segmentation

For analysis purposes the process of authoring should be reversed. To that end, firsta segmentation should be made that decomposes a video document in its layoutand content elements. Results can be used for indexing specific segments. In manycases segmentation can be viewed as a classification problem, and hence patternrecognition techniques are appropriate. However, in video indexing literature manyheuristic methods are proposed. We will first discuss reconstruction of the layoutfor each of the modalities. Finally, we will focus on segmentation of the content.The data flow necessary for analysis is visualized in figure 6.1.

Page 43: Multimedia Lecture Notes

6.2. VIDEO DOCUMENT SEGMENTATION 43

6.2.1 Layout reconstruction

Layout reconstruction is the task of detecting the sensor shots and transition editsin the video data. For analysis purposes layout reconstruction is indispensable.Since the layout guides the spectator in experiencing the video document, it shouldalso steer analysis.

For reconstruction of the visual layout, several techniques already exist to seg-ment a video document on the camera shot level, known as shot boundary detection2.Various algorithms are proposed in video indexing literature to detect cuts in videodocuments, all of which rely on computing the dissimilarity of successive frames.The computation of dissimilarity can be at the pixel, edge, block, or frame level.Which one is important is largely dependent on the kind of changes in contentpresent in the video, whether the camera is moving etc. The resulting dissimilari-ties as function of time are compared with some fixed or dynamic threshold. If thedissimilarity is sufficiently high a cut is declared.

Block level features are popular as they can be derived from motion vectors,which can be computed directly from the visual channel, when coded in MPEG,saving decompression time.

For an extensive overview of different cut detection methods we refer to thesurvey of Brunelli in [18] and the references therein.

Detection of transition edits in the visual modality can be done in several ways.Since the transition is gradual, comparison of successive frames is insufficient. Thefirst researchers exploiting this observation where Zhang et al [164]. They intro-duced the twin-comparison approach, using a dual threshold that accumulates sig-nificant differences to detect gradual transitions. For an extensive coverage of othermethods we again refer to [18], we just summarize the methods mentioned. First,so called plateau detection uses every k -th frame. Another approach is based oneffect modelling, where video production-based mathematical models are used tospot different edit effects using statistical classification. Finally, a third approachmodels the effect of a transition on intensity edges in subsequent frames.

Detection of abrupt cuts in the auditory layout can be achieved by detection ofsilences and transition points, i.e. locations where the category of the underlyingsignal changes. In literature different methods are proposed for their detection.

In [99] it is shown that average energy, En, is a sufficient measure for detectingsilence segments. En is computed for a window, i.e. a set of n samples. If theaverage for all the windows in a segment are found lower than a threshold, a silenceis marked. Another approach is taken in [166]. Here En is combined with the zero-crossing rate (ZCR), where a zero-crossing is said to occur if successive sampleshave different signs. A segment is classified as silence if En is consistently lowerthan a set of thresholds, or if most ZCRs are below a threshold. This method alsoincludes unnoticeable noise.

Li et al [73] use silence detection for separating the input audio segment intosilence segments and signal segments. For the detection of silence periods they use athree-step procedure. First, raw boundaries between silence and signal are markedin the auditory data. In the succeeding two steps a fill-in process and a throwawayprocess are applied to the results. In the fill-in process short silence segmentsare relabelled signal and in the throwaway process low energy signal segments arerelabelled silence.

Besides silence detection [73] also detects transition points in the signal segmentsby using break detection and break merging. They compute an onset break, when aclear increase of signal energy is detected, and an offset break, when a clear decreaseis found, to indicate a potential change in category of the underlying signal, by

2As an ironic legacy from early research on video parsing, this is also referred to as scene-changedetection.

Page 44: Multimedia Lecture Notes

44 CHAPTER 6. VIDEO INDEXING

moving a window over the signal segment and compare En of different halves of thewindow at each sliding position. In the second step, adjacent breaks of the sametype are merged into a single break.

In [166] music is distinguished from speech, silence, and environmental soundsbased on features of the ZCR and the fundamental frequency. To assign the prob-ability of being music to an audio segment, four features are used: the degree ofbeing harmonic (based on the fundamental frequency), the degree to which the au-dio spectrum exhibits a clear peak during a period of time an indication for thepresence of a fundamental frequency , the variance of the ZCR, and the range ofthe amplitude of the ZCR.

The first step in reconstructing the textual layout is referred to as tokenization,in this phase the input text is divided into units called tokens or characters. Detec-tion of text shots can be achieved in different ways, depending on the granularityused. If we are only interested in single words we can use the occurrence of whitespace as the main clue. However, this signal is not necessarily reliable, because ofthe occurrence of periods, single apostrophes and hyphenation [79]. When morecontext is taken into account one can reconstruct sentences from the textual layout.Detection of periods is a basic heuristic for the reconstruction of sentences, about90% of periods are sentence boundary indicators [79]. Transitions are typicallyfound by searching for predefined patterns.

Since layout is very modality dependent, a multimodal approach for its recon-struction won’t be very effective. The task of layout reconstruction can currentlybe performed quite reliably. However, results might improve even further whenmore advanced techniques are used, for example methods exploiting the learningcapabilities of statistical classifiers.

6.2.2 Content segmentation

In subsection 2.5.2 we introduced the elements of content. Here we will discuss howto detect them automatically, using different detection algorithms exploiting visual,auditory, and textual information sources.

People detection

Detection of people in video documents can be done in several ways. They can bedetected in the visual modality by means of their faces or other body parts, in theauditory modality by the presence of speech, and in the textual modality by theappearance of names. In the following, those modality specific techniques will bediscussed in more detail. For an in-depth coverage of the different techniques werefer to the cited references.

Most approaches using the visual modality simplify the problem of people de-tection to detection of a human face. Face detection techniques aim to identifyall image regions which contain a face, regardless of its three-dimensional position,orientation, and lighting conditions used, and if present return their image locationand extents [161]. This detection is by no means trivial because of variability inlocation, orientation, scale, and pose. Furthermore, facial expressions, facial hair,glasses, make-up, occlusion, and lightning conditions are known to make detectionerror prone.

Over the years various methods for the detection of faces in images and imagesequences are reported, see [161] for a comprehensive and critical survey of currentface detection methods. From all methods currently available the one proposed byRowley in [110] performs the best [106]. The neural network-based system is able todetect about 90% of all upright and frontal faces, and more important the systemonly sporadically mistakes non-face areas for faces.

Page 45: Multimedia Lecture Notes

6.2. VIDEO DOCUMENT SEGMENTATION 45

When a face is detected in a video, face recognition techniques aim to identifythe person. A common used method for face recognition is matching by means ofEigenfaces [103]. In eigenface methods templates of size let’s say 20x20 are used.For the example numbers this leads to a 20x20=400 dimensional space. UsingPrincipal Component Analysis a subspace capturing the most relevant informationis computed. Every component in itself is again a 20x20 template. It allows toidentify which information is most important in the matching process. A drawbackof applying face recognition for video indexing, is its limited generic applicability[120]. Reported results [9, 103, 120] show that face recognition works in constrainedenvironments, preferably showing a frontal face close to the camera. When usingface recognition techniques in a video indexing context one should account for thislimited applicability.

In [85] people detection is taken one step further, detecting not only the head,but the whole human body. The algorithm presented first locates the constituentcomponents of the human body by applying detectors for head, legs, left arm, andright arm. Each individual detector is based on the Haar wavelet transform usingspecific examples. After ensuring that these components are present in the propergeometric configuration, a second example-based classifier combines the results ofthe component detectors to classify a pattern as either a person or a non-person.

A similar part-based approach is followed in [38] to detect naked people. First,large skin-colored components are found in an image by applying a skin filter thatcombines color and texture. Based on geometrical constraints between detectedcomponents an image is labelled as containing naked people or not. Obviously thismethod is suited for specific genres only.

The auditory channel also provides strong clues for presence of people in videodocuments through speech in the segment. When layout segmentation has beenperformed, classification of the different signal segments as speech can be achievedbased on the features computed. Again different approaches can be chosen.

In [166] five features are checked to distinguish speech from other auditory sig-nals. First one is the relation between amplitudes of ZCR and energy curves. Thesecond one is the shape of the ZCR curve. The third and fourth features are thevariance and the range of the amplitude of the ZCR curve. The fifth feature isabout the property of the fundamental frequency within a short time window. Adecision value is defined for each feature. Based on these features, classification isperformed using the weighted average of these decision values.

A more elaborated audio segmentation algorithm is proposed in [73]. The au-thors are able to segment not only speech but also speech together with noise, speechor music with an accuracy of about 90%. They compared different auditory featuresets, and conclude that temporal and spectral features perform bad, as opposed toMel-frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC)which achieve a much better classification accuracy.

When a segment is labelled as speech, speaker recognition can be used to identifya person based on his or her speech utterance. Different techniques are proposed,e.g. [91, 100]. A generic speaker identification system consisting of three modules ispresented in [100]. In the first module feature extraction is performed using a setof 14 MFCC from each window. In the second module those features are used toclassify each moving window using a nearest neighbor classifier. The classificationis performed using a ground truth. In the third module results of each movingwindow are combined to generate a single decision for each segment. The authorsreport encouraging performance using speech segments of a feature film.

A strong textual cue for the appearance of people in a video document are wordswhich are names. In [120], for example, natural language processing techniques us-ing a dictionary, thesaurus, and parser are used to locate names in transcripts.The system calculates four different scores. The first measure is a grammatical

Page 46: Multimedia Lecture Notes

46 CHAPTER 6. VIDEO INDEXING

score based on the part-of-speech tagging to find candidate nouns. The second isa lexical score indicating whether a detected noun is related to persons. The situ-ational score is the third score giving an indication whether the word is related tosocial activities involving people. Finally, the positional score for each word in thetranscripts measures where in the text of the newsreader the word is mentioned.A net likelihood score is then calculated which together with the name candidateand segment information forms the system’s output. Related to this problem is thetask of named entity recognition, which is known from the field of computationallinguistics. Here one seeks to classify every word in a document into one of eightcategories: person, location, organization, date, time, percentage, monetary value,or none of the above [11]. In the reference name recognition is viewed as a classifi-cation problem, where every word is either part of some name, or not. The authorsuse a variant of an HMM for the name recognition task based on a bi-gram languagemodel. Compared to any other reported learning algorithm, their name recognitionresults are consistently better.

In conclusion, people detection in video can be achieved using different ap-proaches, all having limitations. Variance in orientation and pose, together withocclusion, make visual detection error prone. Speech detection and recognition isstill sensitive to noise and environmental sounds. Also, more research on detectionof names in text is needed to improve results. As the errors in different modalitiesare not necessarily correlated, a multimodal approach in detection of persons invideo documents can be an improvement. Besides improved detection, fusion ofdifferent modalities is interesting with respect to recognition of specific persons.

Object detection

Object detection forms a generalization of the problem of people detection. Specificobjects can be detected by means of specialized visual detectors, motion, sounds,and appearance in the textual modality. Object detection methods for the differentmodalities will be highlighted here.

Approaches for object detection based on visual appearance can range fromdetection of specific objects to detection approaches of more general objects. Anexample from the former is given in [122], where the presence of passenger cars inimage frames is detected by using multiple histograms. Each histogram representsthe joint statistics of a subset of wavelet coefficients and their position on the object.The authors use statistical modelling to account for variation, which enables themto reliably detect passenger cars over a wide range of points of view.

In the above, we know what we are looking for and the number of classes issmall so one can perform strong segmentation. If not, grouping based on motioni.e. weak segmentation is the best in absence of other knowledge. Moreover, sincethe appearance of objects might vary widely, rigid object motion detection is oftenthe most valuable feature. Thus, when considering the approach for general objectdetection, motion is a useful feature. A typical method to detect moving objects ofinterest starts with a segmentation of the image frame. Regions in the image framesharing similar motion are merged in the second stage. Result is a motion-basedsegmentation of the video. In [93] a method is presented that segments a single videoframe into independently moving visual objects. The method follows a bottom-upapproach, starting with a color-based decomposition of the frame. Regions are thenmerged based on their motion parameters via a statistical test, resulting in superiorperformance over other methods, e.g. [6, 159].

Specific objects can also be detected by analyzing the auditory layout segmen-tation of the video document. Typically, segments in the layout segmentation firstneed to be classified as environmental sounds. Subsequently, the environmentalsounds are further analyzed for the presence of specific object sound patterns. In

Page 47: Multimedia Lecture Notes

6.2. VIDEO DOCUMENT SEGMENTATION 47

[157, 166] for example, specific object sound patterns e.g. dog bark, ringing tele-phones, and different musical instruments are detected by selecting the appropriateauditory features.

Detecting objects in the textual modality also remains a challenging task. Alogical intermediate step in detecting objects of interest in the textual modality ispart-of-speech tagging. Though limited, the information we get from tagging is stillquite useful. By extracting and analyzing the nouns in tagged text for example,and to apply chunking [1], one can make some assumptions about objects present.To our knowledge chunking has not yet been used in combination with detection ofobjects in video documents. Its application however, might prove to be a valuableextension to unimodal object detection.

Successful detection of objects is limited to specific examples. A generic objectdetector still forms the holy grail in video document analysis. Therefore, multimodalobject detection seems interesting. It helps if objects of interest can be identifiedwithin different modalities. Then the specific visual appearance, the specific sound,and its mentioning in the accompanying textual data can yield the evidence forrobust recognition.

Setting detection

For the detection of setting, motion is not so relevant, as the setting is usually static.Therefore, techniques from the field of content-based image retrieval can be used.See [129] for a complete overview of this field. By using for example key frames,those techniques can easily be used for video indexing. We focus here on methodsthat assign a setting label to the data, based on analysis of the visual, auditory, ortextual modality.

In [137] images are classified as either indoor or outdoor, using three typesof visual features: one for color, texture, and frequency information. Instead ofcomputing features on the entire image, the authors use a multi-stage classificationapproach. First, sub-blocks are classified independently, and afterwards anotherclassification is performed using the k-nearest neighbor classifier.

Outdoor images are further classified into city and landscape images in [145].Features used are color histograms, color coherence vectors, Discrete Cosine Trans-form (DCT) coefficients, edge direction histograms, and edge direction coherencevectors. Classification is done with a weighted k-nearest neighbor classifier withleave-one out method. Reported results indicate that the edge direction coherencevector has good discriminatory power for city vs. landscape. Furthermore, it wasfound that color can be an important cue in classifying natural landscape imagesinto forests, mountains, or sunset/sunrise classes. By analyzing sub-blocks, the au-thors detect the presence of sky and vegetation in outdoor image frames in anotherpaper. Each sub-block is independently classified, using a Bayesian classificationframework, as sky vs. non-sky or vegetation vs. non-vegetation based on color,texture, and position features [144].

Detecting setting based on auditory information, can be achieved by detectingspecific environmental sound patterns. In [157] the authors reduce an auditory seg-ment to a small set of parameters using various auditory features, namely loudness,pitch, brightness, bandwidth, and harmonicity. By using statistical techniques overthe parameter space the authors accomplish classification and retrieval of severalsound patterns including laughter, crowds, and water. In [166] classes of naturaland synthetic sound patterns are distinguished by using an HMM, based on timbreand rhythm. The authors are capable of classifying different environmental set-ting sound patterns, including applause, explosions, rain, river flow, thunder, andwindstorm.

The transcript is used in [25] to extract geographic reference information for the

Page 48: Multimedia Lecture Notes

48 CHAPTER 6. VIDEO INDEXING

Visual layout

Visual content

Video data

Video segmentation Integration

Textual layout

Auditory layout

Textual content

Auditory content

Conversion

Speechrecognition

VideoOCR

Nam

ed

even

ts

Lo

gic

al

un

its

Su

b-g

en

re

Gen

re

Semantic index

Pu

rpo

se

Figure 6.2: Role of conversion and integration in multimodal video document anal-ysis.

video document. The authors match named places to their spatial coordinates. Theprocess begins by using the text metadata as the source material to be processed. Aknown set of places along with their spatial coordinates, i.e. a gazetteer, is createdto resolve geographic references. The gazetteer used consists of approximately 300countries, states and administrative entities, and 17000 major cities worldwide.After post processing steps, e.g. including related terms and removing stop words,the end result are segments in a video sequence indexed with latitude and longitude.

We conclude that the visual and auditory modality are well suited for recognitionof the environment in which the video document is situated. By using the textualmodality, a more precise (geographic) location can be extracted. Fusion of thedifferent modalities may provide the video document with semantically interestingsetting terms such as: outside vegetation in Brazil near a flowing river. Which cannever be derived from one of the modalities in isolation.

6.3 Multimodal analysis

After reconstruction of the layout and content elements, the next step in the inverseanalysis process is analysis of the layout and content to extract the semantic index.At this point the modalities should be integrated. However, before analysis, it mightbe useful to apply modality conversion of some elements into more appropriate form.The role of conversion and integration in multimodal video document analysis willbe discussed in this section, and is illustrated in figure 6.2.

6.3.1 Conversion

For analysis, conversion of elements of visual and auditory modalities to text ismost appropriate.

A typical component we want to convert from the visual modality is overlayedtext. Video Optical Character Recognition (OCR) methods for detection of textin video frames can be divided into component-based, e.g. [125], or texture-basedmethods, e.g. [74]. A method utilizing the DCT coefficients of compressed videowas proposed in [168]. By using Video OCR methods, the visual overlayed textobject can be converted into a textual format. The quality of the results of Video

Page 49: Multimedia Lecture Notes

6.3. MULTIMODAL ANALYSIS 49

OCR vary, depending on the kind of characters used, their color, their stability overtime, and the quality of the video itself.

From the auditory modality one typically wants to convert the uttered speechinto transcripts. Available speech recognition systems are known to be maturefor applications with a single speaker and a limited vocabulary. However, theirperformance degrades when they are used in real world applications instead of a labenvironment [18]. This is especially caused by the sensitivity of the acoustic modelto different microphones and different environmental conditions. Since conversionof speech into transcripts still seems problematic, integration with other modalitiesmight prove beneficial.

Note that other conversions are possible, e.g. computer animation can be viewedas converting text to video. However, these are relevant for presentation purposesonly.

6.3.2 Integration

The purpose of integration of multimodal layout and content elements is to improveclassification performance. To that end the addition of modalities may serve as averification method, a method compensating for inaccuracies, or as an additionalinformation source.

An important aspect, indispensable for integration, is synchronization and align-ment of the different modalities, as all modalities must have a common timeline.Typically the time stamp is used. We observe that in literature modalities are con-verted to a format conforming to the researchers main expertise. When audio is themain expertise, image frames are converted to (milli)seconds, e.g. [55]. In [4, 34]image processing is the main expertise, and audio samples are assigned to imageframes or camera shots. When a time stamp isn’t available, a more advanced align-ment procedure is necessary. Such a procedure is proposed in [60]. The error proneoutput of a speech recognizer is compared and aligned with the accompanying closedcaptions of news broadcasts. The method first finds matching sequences of wordsin the transcript and closed caption by performing a dynamic-programming3 basedalignment between the two text strings. Segments are then selected when sequencesof three or more words are similar in both resources.

To achieve the goal of multimodal integration, several approaches can be fol-lowed. We categorize those approaches by their distinctive properties with respectto the processing cycle, the content segmentation, and the classification methodused. The processing cycle of the integration method can be iterated, allowingfor incremental use of context, or non-iterated. The content segmentation can beperformed by using the different modalities in a symmetric, i.e. simultaneous, orasymmetric, i.e. ordered, fashion. Finally, for the classification one can choosebetween a statistical or knowledge-based approach. An overview of the differentintegration methods found in literature is in table 1.

Most integration methods reported are symmetric and non-iterated. Some followa knowledge-based approach for classification of the data into classes of the semanticindex hierarchy [37, 90, 105, 119, 142]. In [142] for example, the auditory and visualmodality are integrated to detect speech, silence, speaker identities, no face shot /face shot / talking face shot using knowledge-based rules. First, talking people aredetected by detecting faces in the camera shots, subsequently a knowledge-basedmeasure is evaluated based on the amount of speech in the shot.

Many methods in literature follow a statistical approach [4, 22, 33, 34, 55, 60,61, 92, 120, 155]. An example of a symmetric, non-iterated statistical integration

3Dynamic programming is a programming technique in which intermediate results in someiterative process are stored so they can be reused in later iterations, rather than recomputed.

Page 50: Multimedia Lecture Notes

50 CHAPTER 6. VIDEO INDEXING

Table 6.1: An overview of different integration methods.Content Segmentation Classification Method Processing CycleSymmetric Asymmetric Statistical Knowledge Iterated Non-Iterated

[4] X X X[8] X X X[22] X X X[33] X X X[34] X X X[37] X X X[55] X X X[55] X X X[60] X X X[61] X X X[90] X X X[92] X X X[105] X X X[119] X X X[120] X X X[132] X X X[142] X X X[155] X X X

method is the Name-It system presented in [120]. The system associates detectedfaces and names, by calculating a co-occurrence factor that combines the analysisresults of face detection and recognition, name extraction, and caption recognition.A high-occurrence factor indicates that a certain visual face template is often asso-ciated with a certain name in either the caption in the image, or in the associatedtext hence a relation between face and name can be concluded.

Hidden Markov Models are frequently used as a statistical classification methodfor multimodal integration [4, 33, 34, 55]. A clear advantage of this framework isthat it is not only capable to integrate multimodal features, but is also capableto include sequential features. Moreover, an HMM can also be used as a classifiercombination method.

When modalities are independent, they can easily be included in a productHMM. In [55] such a classifier is used to train two modalities separately, which arethen combined symmetrically, by computing the product of the observation prob-abilities. It is shown that this results in significant improvement over a unimodalapproach.

In contrast to the product HMM method, a neural network-based approachdoesn’t assume features are independent. The approach presented in [55], trainsan HMM for each modality and category. A three layer perceptron is then used tocombine the outputs from each HMM in a symmetric and non-iterated fashion.

Another advanced statistical classifier for multimodal integration was recentlyproposed in [92]. A probabilistic framework for semantic indexing of video doc-uments based on so called multijects and multinets is presented. The multijectsmodel content elements which are integrated in the multinets to model the rela-tions between objects, allowing for symmetric use of modalities. For the integrationin the multinet the authors propose a Bayesian belief network [101], which is aprobabilistic description of the relation between different variables. Significant im-provements of detection performance is demonstrated. Moreover, the frameworksupports detection based on iteration. Viability of the Bayesian network as a sym-metric integrating classifier was also demonstrated in [61], however that methoddoesn’t support iteration.

In contrast to the above symmetric methods, an asymmetric approach is pre-sented in [55]. A two-stage HMM is proposed which first separates the input videodocument into three broad categories based on the auditory modality, in the secondstage another HMM is used to split those categories based on the visual modality.

Page 51: Multimedia Lecture Notes

6.4. SEMANTIC VIDEO INDEXES 51

A drawback of this method is its application dependency, which may result in lesseffectiveness in other classification tasks.

An asymmetric knowledge-based integration method, supporting iteration, wasproposed in [8]. First, the visual and textual modality are combined to generatesemantic index results. Those form the input for a post-processing stage that usesthose indexes to search the visual modality for the specific time of occurrence of thesemantic event.

For exploration of other integration methods, we again take a look in the fieldof content-based image retrieval. From this field methods are known to integratethe visual and textual modality by combining images with associated captions orHTML tags. Early reported methods used a knowledge base for integration, e.g. thePiction system [132]. This system uses modalities asymmetrically, it first analyzesthe caption to identify the expected number of faces and their expected relativepositions. Then a face detector is applied to a restricted part of the image, if nofaces are detected an iteration step is performed that relaxes the thresholds. Morerecently, Latent Semantic Indexing (LSI) [31] has become a popular means forintegration [22, 155]. LSI is symmetric and non-iterated and works by statisticallyassociating related words to the conceptual context of the given document. In effectit relates documents that use similar terms, which for images are related to featuresin the image. Thus, it has a strong relation to co-occurrence based methods. In [22]LSI is used to capture text statistics in vector form from an HTML document.Words with specific HTML tags are given higher weights. In addition, the positionof the words with respect to the position of the image in the document is alsoaccounted for. The image features, that is the color histogram and the dominantorientation histogram, are also captured in vector form and combined they form aunified vector that the authors use for content-based search of a WWW-based imagedatabase. Reported experiments show that maximum improvement was achievedwhen both visual and textual information are employed.

In conclusion, video indexing results improve when a multimodal approach isfollowed. Not only because of enhancement of content findings, but also becausemore information is available. Most methods integrate in a symmetric and non-iterated fashion. Usage of incremental context by means of iteration can be avaluable addition to the success of the integration process. Usage of combinedstatistical classifiers in multimodal video indexing literature is still scarce, thoughvarious successful statistical methods for classifier combinations are known, e.g.bagging, boosting, or stacking [57]. So, probably results can be improved evenmore substantially when advanced classification methods from the field of statisticalpattern recognition, or other disciplines are used, preferably in an iterated fashion.

6.4 Semantic video indexes

The methodologies described in section 6.3 have been applied to extract a varietyof the different video indexes described in subsection 2.5.1. In this section wesystematically report on the different indexes and the information from which theyare derived. As methods for extraction of purpose are not mentioned in literature,this level is excluded. Figure 6.3 presents an overview of all indexes and the methodsin literature which can derive them.

6.4.1 Genre

“Editing is an important stylistic element because it affects the overall rhythm of thevideo document” [12]. Hence, layout related statistics are well suited for indexinga video document into a specific genre. Most obvious element of this editorial style

Page 52: Multimedia Lecture Notes

52 CHAPTER 6. VIDEO INDEXING

is the average shot length. Generally, the longer the shots, the slower the rhythmof the video document.

The rate of shot changes together with the presence of black frames is usedin [53] to detect commercials within news broadcast. The rationale behind detectionof black frames is that they are often broadcasted for a fraction of a second before,after, and between commercials. However, black frames can also occur for otherreasons. Therefore, the authors use the observation that advertisers try to makecommercials more interesting by rapidly cutting between different shots, resultingin a higher shot change rate. A similar approach is followed in [75], for detectingcommercials within broadcasted feature films. Besides the detection of monochromeframes and shot change rate, the authors use the edge change ratio and motionvector length to capture high action in commercials.

Average shot length, the percentage of different types of edit transitions, and sixvisual content features, are used in [141] to classify a video document into cartoons,commercials, music, news and sports video genres. As a classifier a specific decisiontree called C4.5 is used [67] which can work both on real and symbolic values.

In [33] the observation is made that different genres exhibit different temporalpatterns of face locations. They furthermore observe that the temporal behavior ofoverlaid text is genre dependent. In fact the following genre dependent functionscan be identified:

• News: annotation of people, objects, setting, and named events;

• Sports: player identification, game related statistics;

• Movies/TV series: credits, captions, and language translations;

• Commercials: product name, claims, and disclaimers;

Based on results of face and text tracking, each frame is assigned one of 15 labels,describing variations on the number of appearing faces and/or text lines togetherwith the distance of a face to the camera. These labels form the input for an HMM,which classifies an input video document into news, commercials, sitcoms, and soapsbased on maximum likelihood.

Detection of generic sport video documents seems almost impossible due to thelarge variety in sports. In [66], however, a method is presented that is capable ofidentifying mainstream sports videos. Discriminating properties of sport videos arethe presence of slow-motion replays, large amounts of overlayed text, and specificcamera/object motion. The authors propose a set of eleven features to capturethese properties, and obtain 93% accuracy using a decision tree classifier. Analysisshowed that motion magnitude and direction of motion features yielded the bestresults.

Methods for indexing video documents into a specific genre using a multimodalapproach are reported in [37, 55, 61]. In [55] news reports, weather forecasts, com-mercials, basketball, and football games are distinguished based on audio and visualinformation. The authors compare different integration methods and classifiers andconclude that a product HMM classifier is most suited for their task, see also 6.3.2.

The same modalities are used in [37]. The authors present a three-step approach.In the first phase, content features such as color statistics, motion vectors and audiostatistics are extracted. Secondly, layout features are derived, e.g. shot lengths,camera motion, and speech vs. music. Finally, a style profile is composed and aneducational guess is made as to the genre in which a shot belongs. They reportpromising results by combining different layout and content attributes of video foranalysis, and can find five (sub)genres, namely news broadcasts, car racing, tennis,commercials, and animated cartoons.

Page 53: Multimedia Lecture Notes

6.4. SEMANTIC VIDEO INDEXES 53

Besides auditory and visual information, [61] also exploits the textual modality.The segmentation and indexing approach presented uses three layers to processlow-, mid-, and high-level information. At the lowest level features such as color,shape, MFCC, ZCR, and the transcript are extracted. Those are used in the mid-level to detect faces, speech, keywords, etc. At the highest level the semanticindex is extracted through the integration of mid-level features across the differentmodalities, using Bayesian networks, as noted in subsection 6.3.2. In its currentimplementation the presented system classifies segments as either part of a talkshow, commercial or financial news.

6.4.2 Sub-genre

Research on indexing sub-genres, or specific instances of a genre, has been gearedmainly towards sport videos [37, 55, 114] and commercials [27]. Obviously, futureindex techniques may also extract other sub-genres, for example westerns, comedies,or thrillers within the feature film genre.

Four sub-genres of sport video documents are identified in [114]: basketball, icehockey, soccer, and volleyball. The full motion fields in consecutive frames are usedas a feature. To reduce the feature space, Principal Component Analysis is used.For classification two different statistical classifiers were applied. It was found that acontinuous observation density Markov model gave the best results. The sequencesanalyzed were post-edited to contain only the play of the sports, which is a drawbackof the presented system. For instance, no crowd scenes or time outs were included.Some sub-genres of sport video documents are also detected in [37, 55], as noted insection 6.4.1.

An approach to index commercial videos based on semiotic and semantic prop-erties is presented in [27]. The general field of semiotics is the study of signs andsymbols, what they mean and how they are used. For indexing of commercials thesemiotics approach classifies commercials into four different sub-genres that relateto the narrative of the commercial. The following four sub-genres are distinguished:practical, critical, utopic, and playful commercials. Perceptual features e.g. satu-rated colors, horizontal lines, and the presence or absence of recurring colors, aremapped onto the semiotic categories. Based on research in the marketing field,the authors also formalized a link between editing, color, and motion effects on theone hand, and feelings that the video arouses in the observer on the other. Char-acteristics of a commercial are related to those feelings and have been organizedin a hierarchical fashion. A main classification is introduced between commercialsthat induce feelings of action and those that induce feelings of quietness. The au-thors subdivide action further into suspense and excitement. Quietness is furtherspecified in relaxation and happiness.

6.4.3 Logical units

Detection of logical units in video documents is extensively researched with respectto the detection of scenes or Logical Story Units (LSU) in feature films and sitcoms.An overview and evaluation of such methods is presented in [148]. A summary ofthat paper follows. After that we consider how to give the LSU a proper label.

Logical story unit detection

In cinematography an LSU is defined as a series of shots that communicate a unifiedaction with a common locale and time [13]. Viewers perceive the meaning of a videoat the level of LSUs [14, 113].

Page 54: Multimedia Lecture Notes

54 CHAPTER 6. VIDEO INDEXING

A problem for LSU segmentation using visual similarity is that it seems toconflict with its definition based on the semantic notion of common locale andtime. There is no one-to-one mapping between the semantic concepts and the data-driven visual similarity. In practice, however, most LSU boundaries coincide with achange of locale, causing a change in the visual content of the shots. Furthermore,usually the scenery in which an LSU takes place does not change significantly, orforeground objects will appear in several shots, e.g. talking heads in the case of adialogue. Therefore, visual similarity provides a proper base for common locale.

There are two complicating factors regarding the use of visual similarity. Firstly,not all shots in an LSU need to be visually similar. For example, one can have asudden close-up of a glass of wine in the middle of a dinner conversation showingtalking heads. This problem is addressed by the overlapping links approach [50]which assigns visually dissimilar shots to an LSU based on temporal constraints.Secondly, at a later point in the video, time and locale from one LSU can be repeatedin another, not immediate succeeding LSU.

The two complicating factors apply to the entire field of LSU segmentationbased on visual similarity. Consequently, an LSU segmentation method using visualsimilarity depends on the following three assumptions:

Assumption 1 The visual content in an LSU is dissimilar from the visual contentin a succeeding LSU.

Assumption 2 Within an LSU, shots with similar visual content are repeated.

Assumption 3 If two shots σx and σy are visually similar and assigned to thesame LSU, then all shots between σx and σy are part of this LSU.

For parts of a video where the assumptions are not met, segmentation resultswill be unpredictable.

Given the assumptions, LSU segmentation methods using visual similarity canbe characterized by two important components, viz. the shot distance measurementand the comparison method. The former determines the (dis)similarity mentionedin assumptions 1 and 2. The latter component determines which shots are comparedin finding LSU boundaries. Both components are described in more detail.

Shot distance measurement. The shot distance δ represents the dissimilaritybetween two shots and is measured by combining (typically multiplying) measure-ments for the visual distance δv and the temporal distance δt. The two distanceswill now be explained in detail.

Visual distance measurement consists of dissimilarity function δvf for a visual

feature f measuring the distance between two shots. Usually a threshold τvf is used

to determine whether two shots are close or not. δvf and τv

f have to be chosensuch that the distance between shots in an LSU is small (assumption 2), while thedistance between shots in different LSUs is large (assumption 1).

Segmentation methods in literature do not depend on specific features or dis-similarity functions, i.e. the features and dissimilarity functions are interchangeableamongst methods.

Temporal distance measurement consists of temporal distance function δt. Asobserved before, shots from not immediate succeeding LSUs can have similar con-tent. Therefore, it is necessary to define a time window τ t, determining what shotsin a video are available for comparison. The value for τ t, expressed in shots orframes, has to be chosen such that it resembles the length of an LSU. In practice,the value has to be estimated since LSUs vary in length.

Function δt is either binary or continuous. A binary δt results in 1 if two shotsare less than τ t shots or frames apart and ∞ otherwise [162]. A continuous δt

reflects the distance between two shots more precisely. In [113], δt ranges from 0 to

Page 55: Multimedia Lecture Notes

6.4. SEMANTIC VIDEO INDEXES 55

1. As a consequence, the further two shots are apart in time, the closer the visualdistance has to be to assigned them to the same LSU. Time window τ t is still usedto mark the point after which shots are considered dissimilar. Shot distance is thenset to ∞ regardless of the visual distance.

The comparison method is the second important component of LSU segmenta-tion methods. In sequential iteration, the distance between a shot and other shotsis measured pair-wise. In clustering, shots are compared group-wise. Note that inthe sequential approach still many comparisons can be made, but always of one pairof shots at the time.

Methods from literature can now be classified according to the framework. Thevisual distance function is not discriminatory, since it is interchangeable amongstmethods. Therefore, the two discriminating dimensions for classification of methodsare temporal distance function and comparison method. Their names in literatureand references to methods are given in table 6.2. Note that in [65] 2 approaches arepresented.

Table 6.2: Classification of LSU segmentation methods.Comparison Temporal distance function

method binary continuoussequential overlapping links

[50], [68], [71], [3],[24]

continuous videocoherence [65],[135], [77]

clustering time constrainedclustering [162],[76], [115]

time adaptivegrouping [113],[65], [150]

Labelling logical units

Detection of LSU boundaries alone is not enough. For indexing, we are especiallyinterested in its accompanying label.

A method that is capable of detecting dialogue scenes in movies and sitcoms, ispresented in [4]. Based on audio analysis, face detection, and face location analysisthe authors generate output labels which form the input for an HMM. The HMMoutputs a scene labeled as either, establishing scene, transitional scene, or dialoguescene. According to the results presented, combined audio and face informationgives the most consistent performance of different observation sets and trainingdata. However, in its current design, the method is incapable of differentiatingbetween dialogue and monologue scenes.

A technique to characterize and index violent scenes in general TV drama andmovies is presented in [90]. The authors integrate cues from both the visual andauditory modality symmetrically. First, a measure of activity for each video shotis computed as a measure of action. This is combined with detection of flamesand blood using a predefined color table. The corresponding audio informationprovides supplemental evidence for the identification of violent scenes. The focus ison the abrupt change in energy level of the audio signal, computed using the energyentropy criterion. As a classifier the authors use a knowledge-based combination offeature values on scene level.

By utilizing a symmetric and non-iterated multimodal integration method fourdifferent types of scenes are identified in [119]. The audio signal is segmentedinto silence, speech, music, and miscellaneous sounds. This is combined with avisual similarity measure, computed within a temporal window. Dialogues are then

Page 56: Multimedia Lecture Notes

56 CHAPTER 6. VIDEO INDEXING

detected based on the occurrence of speech and an alternated pattern of visuallabels, indicating a change of speaker. When the visual pattern exhibits a repetitionthe scene is labeled as story. When the audio signal isn’t labeled as speech, andthe visual information exhibits a sequence of visually non-related shots, the sceneis labeled as action. Finally, scenes that don’t fit in the aforementioned categoriesare indexed as generic scenes.

In contrast to [119], a unimodal approach based on the visual information sourceis used in [163] to detect dialogues, actions, and story units. Shots that are visuallysimilar and temporally close to each other are assigned the same (arbitrary) label.Based on the patterns of labels in a scene, it is indexed as either dialogue, action,or story unit.

A scheme for reliably identifying logical units which clusters sensor shots accord-ing to detected dialogues, similar settings, or similar audio is presented in [105]. Themethod starts by calculating specific features for each camera and microphone shot.Auditory, color, and orientation features are supported as well as face detection.Next an Euclidean metric is used to determine the distance between shots with re-spect to the features, resulting in a so called distance table. Based on the distancetables, shots are merged into logical units using absolute and adaptive thresholds.

News broadcasts are far more structured than feature films. Researchers haveexploited this to classify logical units in news video using a model-based approach.Especially anchor shots, i.e. shots in which the newsreader is present, are easyto model and therefore easy to detect. Since there is only minor body movementthey can be detected by comparison of the average difference between (regionsin) successive frames. This difference will be minimal. This observation is usedin [45, 124, 165]. In [45, 124] also the restricted position and size of detected facesis used.

Another approach for the detection of anchor shots is taken in [10, 49, 56]. Rep-etition of visually similar anchor shots throughout the news broadcast is exploited.To refine the classification of the similarity measure used, [10] requires anchor shotscandidates to have a motion quantity below a certain threshold. Each shot is clas-sified as either anchor or report. Moreover, textual descriptors are added basedon extracted captions and recognized speech. To classify report and anchor shots,the authors in [56] use face and lip movement detection. To distinguish anchorshots, the aforementioned classification is extended with the knowledge that anchorshots are graphically similar and occur frequently in a news broadcast. The largestcluster of similar shots is therefore assigned to the class of anchor shots. Moreover,the detection of a title caption is used to detect anchor shots that introduce a newtopic. In [49] anchor shots are detected together with silence intervals to indicatereport boundaries. Based on a topics database the presented system finds the mostprobable topic per report by analyzing the transcribed speech. Opposed to [10, 56],final descriptions are not added to shots, but to a sequence of shots that constitutea complete report on one topic. This is achieved by merging consecutive segmentswith the same topic in their list of most probable topics.

Besides the detection of anchor persons and reports, other logical units can beidentified. In [34] six main logical units for TV broadcast news are distinguished,namely, begin, end, anchor, interview, report, and weather forecast. Each logicalunit is represented by an HMM. For each frame of the video one feature vectoris calculated consisting of 25 features, including motion and audio features. Theresulting feature vector sequence is assigned to a logical unit based on the sequenceof HMMs that maximizes the probability of having generated this feature vectorsequence. By using this approach parsing and indexing of the video is performed inone pass through the video only.

Other examples of highly structured TV broadcasts are talk and game shows.In [62] a method is presented that detects guest and host shots in those video

Page 57: Multimedia Lecture Notes

6.4. SEMANTIC VIDEO INDEXES 57

documents. The basic observation used is that in most talk shows the same personis host for the duration of the program but guests keep on changing. Also host shotsare typically shorter since only the host asks questions. For a given show, the keyframes of the N shortest shots containing one detected face are correlated in timeto find the shot most often repeated. The key host frame is then compared againstall key frames to detect all similar host shots, and guest shots.

In [160] a model for segmenting soccer video into the logical units break and playis given. A grass-color ratio is used to classify frames into three views according tovideo shooting scale, namely global, zoom-in, and close-up. Based on segmentationrules, the different views are mapped. Global views are classified as play and close-ups as breaks if they have a minimum length. Otherwise a neighborhood votingheuristic is used for classification.

6.4.4 Named events

Named events are at the lowest level in the semantic index hierarchy. For theirdetection different techniques have been used.

A three-level event detection algorithm is presented in [47]. The first level of thealgorithm extracts generic color, texture, and motion features, and detects spatio-temporal object. The mid-level employs a domain dependent neural network toverify whether the detected objects belong to conceptual objects of interest. Thegenerated shot descriptors are then used by a domain-specific inference process atthe third level to detect the video segments that contain events of interest. To testthe effectiveness of the algorithm the authors applied it to detect animal hunt eventsin wildlife documentaries.

Violent events and car chases in feature films are detected in [86], based on anal-ysis of environmental sounds. First, low level sounds as engines, horns, explosions,or gunfire are detected, which constitute part of the high level sound events. Basedon the dominance of those low level sounds in a segment it is labeled with a highlevel named event.

Walking shots, gathering shots, and computer graphics shots in broadcast newsare the named events detected in [56]. A walking shot is classified by detecting thecharacteristic repetitive up and down movement of the bottom of a facial region.When more than two similar sized facial regions are detected in a frame, a shot isclassified as a gathering shot. Finally, computer graphics shots are classified by atotal lack of motion in a series of frames.

The observation that authors use lightning techniques to intensify the drama ofcertain scenes in a video document is exploited in [140]. An algorithm is presentedthat detects flashlights, which is used as an identifier for dramatic events in featurefilms, based on features derived from the average frame luminance and the framearea influenced by the flashing light. Five types of dramatic events are identifiedthat are related to the appearance of flashlights, i.e. supernatural power, crisis,terror, excitement, and generic events of great importance.

Whereas a flashlight can indicate a dramatic event in feature films, slow motionreplays are likely to indicate semantically important events in sport video docu-ments. In [97] a method is presented that localizes such events by detecting slowmotion replays. The slow-motion segments are modelled and detected by an HMM.

One of the most important events in a sport video document is a score. In [8] alink between the visual and textual modalities is made to identify events that changethe score in American football games. The authors investigate whether a chain ofkeywords, corresponding to an event, is found from the closed caption stream or not.In the time frames corresponding to those keywords, the visual stream is analyzed.Key frames of camera shots in the visual stream are compared with predefined

Page 58: Multimedia Lecture Notes

58 CHAPTER 6. VIDEO INDEXING

Volleyball

[23]

SportMusic Documentary CommercialCartoon SoapSitcomFeature filmTalk show News

Basketball SemioticTennis Car racing

Guest Host Anchor Report WeatherInterview

Wildlife

HuntsViolenceCar chaseHighlights Walking Gathering GraphicalDramatic eventsSport events

[20, 23, 34,

35, 40, 45,84]

[84] [23, 84][42, 84] [20, 23,35, 84]

[20] [20]

[35, 72]

[21, 35]

[16]

[73, 96] [53, 73, 96][3, 66,73, 96]

[41]

Dialogue StoryPlay Break

[7, 21, 28,

31, 36, 77,98]

[7, 21,31, 36]

[21]

[29][50] [50] [36] [36] [36][83][59][5, 11, 27,

48, 71, 7581, 100, 102]

[94]

Video document

FootballSoccer

[35][72] [72] [72] [23]

[41] [94]

Ice hockey

Action

Financial

[40]

[40]

Entertainment Information Communication

Semantic

[16]

Figure 6.3: Semantic index hierarchy with instances as found in literature. Fromtop to bottom instances from genre, sub-genre, logical units, and named events. Thedashed box is used to group similar nodes. Note, this picture is taken from anotherpaper, the numbers indicated as references are not correct.

templates using block matching based on the color distribution. Finally, the shotis indexed by the most likely score event, for example a touchdown.

Besides American football, methods for detecting events in tennis [84, 134, 167],soccer [16, 44], baseball [111, 167] and basketball [121, 169] are reported in literature.Commonly, the methods presented exploit domain knowledge and simple (visual)features related to color, edges, and camera/object motion to classify typical sportspecific events e.g. smashes, corner kicks, and dunks using a knowledge-based clas-sifier. An exception to this common approach is [111], which presents an algorithmthat identifies highlights in baseball video by analyzing the auditory modality only.Highlight events are identified by detecting excited speech of the commentators andthe occurrence of a baseball pitch and hit.

Besides semantic indexing, detection of named events also forms a great resourcefor reuse of video documents. Specific information can be retrieved and reused indifferent contexts, or reused to automatically generate summaries of video docu-ments. This seems especially interesting for, but is not limited to, video documentsfrom the sport genre.

6.4.5 Discussion

Now that we have described the different semantic index techniques, as encounteredin literature, we are able to distinguish the most prominent content and layoutproperties per genre. As variation in the textual modality is in general too diversefor differentiation of genres, and more suited to attach semantic meaning to logicalunits and named events, we focus here on properties derived from the visual andauditory modality only. Though, a large amount of genres can be distinguished, welimit ourselves to the ones mentioned in the semantic index hierarchy in figure 6.3,i.e. talk show, music, sport, feature film, cartoon, sitcom, soap, documentary, news,and commercial. For each of those genres we describe the characteristic properties.

Most prominent property of the first genre, i.e. talk shows, is their well-definedstructure, uniform setting, and prominent presence of dialogues, featuring mostly

Page 59: Multimedia Lecture Notes

6.4. SEMANTIC VIDEO INDEXES 59

non-moving frontal faces talking close to the camera. Besides closing credits, thereis in general a limited use of overlayed text.

Whereas talk shows have a well-defined structure and limited setting, musicclips show great diversity in setting and mostly have ill-defined structure. More-over, music will have many short camera shots, showing lots of camera and objectmotion, separated by many gradual transition edits and long microphone shots con-taining music. The use of overlayed text is mostly limited to information about theperforming artist and the name of the song on a fixed position.

Sport broadcasts come in many different flavors, not only because there exist atremendous amount of sport sub-genres, but also because they can be broadcastedlive or in summarized format. Despite this diversity, most authored sport broadcastsare characterized by a voice over reporting on named events in the game, a watchingcrowd, high frequency of long camera shots, and overlayed text showing game andplayer related information on a fixed frame position. Usually sport broadcastscontain a vast amount of camera motion, objects, and players within a limiteduniform setting. Structure is sport-specific, but in general, a distinction betweendifferent logical units can be made easily. Moreover, a typical property of sportbroadcasts is the use of replays showing events of interest, commonly introducedand ended by a gradual transition edit.

Feature film, cartoon, sitcom, and soap share similar layout and content prop-erties. They are all dominated by people (or toons) talking to each other or takingpart in action scenes. They are structured by means of scenes. The setting is mostlylimited to a small amount of locales, sometimes separated by means of visual, e.g.gradual, or auditory, e.g. music, transition edits. Moreover, setting in cartoonsis characterized by usage of saturated colors, also the audio in cartoons is almostnoise-free due to studio recording of speech and special effects. For all mentionedgenres the usage of overlayed text is limited to opening and/or closing credits. Fea-ture film, cartoon, sitcom, and soap differ with respect to people appearance, usageof special effects, presence of object and camera motion, and shot rhythm. Ap-pearing people are usually filmed frontal in sitcoms and soaps, whereas in featurefilms and cartoons there is more diversity in appearance of people or toons. Specialeffects are most prominent in feature films and cartoons, laughter of an imaginarypublic is sometimes added to sitcoms. In sitcoms and soaps there is limited cameraand object motion. In general cartoons also have limited camera motion, thoughobject motion appears more frequently. In feature films both camera and objectmotion are present. With respect to shot rhythm it seems legitimate to state thatthis has stronger variation in feature films and cartoons. The perceived rhythm willbe slowest for soaps, resulting in more frequent use of camera shots with relativelong duration.

Documentaries can also be characterized by their slow rhythm. Other propertiesthat are typical for this genre are the dominant presence of a voice over narratingabout the content in long microphone shots. Motion of camera and objects mightbe present in the documentary, the same holds for overlayed text. Mostly there isno well-defined structure. Special effects are seldom used in documentaries.

Most obvious property of news is its well-defined structure. Different newsreports and interviews are alternated by anchor persons introducing, and narratingabout, the different news topics. A news broadcast is commonly ended by a weatherforecast. Those logical units are mostly dominated by monologues, e.g. peopletalking in front of a camera showing little motion. Overlayed text is frequently usedon fixed positions for annotation of people, objects, setting, and named events. Areport on an incident may contain camera and object motion. Similarity of studiosetting is also a prominent property of news broadcasts, as is the abrupt nature oftransitions between sensor shots.

Some prominent properties of the final genre, i.e. commercials, are similar to

Page 60: Multimedia Lecture Notes

60 CHAPTER 6. VIDEO INDEXING

those of music. They have a great variety in setting, and share no common structure,although they are authored carefully, as the message of the commercial has to beconveyed in twenty seconds or so. Frequent usage of abrupt and gradual transition,in both visual and auditory modality, is responsible for the fast rhythm. Usuallylots of object and camera motion, in combination with special effects, such as aloud volume, is used to attract the attention of the viewer. Difference with musicis that black frames are used to separate commercials, the presence of speech, thesuperfluous and non-fixed use of overlayed text, a disappearing station logo, andthe fact that commercials usually end with a static frame showing the product orbrand of interest.

Due to the large variety in broadcasting formats, which is a consequence ofguidance by different authors, it is very difficult to give a general description forthe structure and characterizing properties of the different genres. When consid-ering sub-genres this will only become more difficult. Is a sports program showinghighlights of today’s sport matches a sub-genre of sport or news? Reducing theprominent properties of broadcasts to instances of layout and content elements, andsplitting of the broadcasts into logical units and named events seems a necessaryintermediate step to arrive at a more consistent definition of genre and sub-genre.More research on this topic is still necessary.

6.5 Conclusion

Viewing a video document from the perspective of its author, enabled us to presenta framework for multimodal video indexing. This framework formed the startingpoint for our review on different state-of-the-art video indexing techniques. More-over, it allowed us to answer the three different questions that arise when assigningan index to a video document. The question what to index was answered by re-viewing different techniques for layout reconstruction. We presented a discussion onreconstruction of content elements and integration methods to answer the how toindex question. Finally, the which index question was answered by naming differentpresent and future index types within the semantic index hierarchy of the proposedframework.

At the end of this review we stress that multimodal analysis is the future. How-ever, more attention, in the form of research, needs to be given to the followingfactors:

1. Content segmentation

Content segmentation forms the basis of multimodal video analysis. In con-trast to layout reconstruction, which is largely solved, there is still a lot to begained in improved segmentation for the three content elements, i.e. people,objects, and setting. Contemporary detectors are well suited for detectionand recognition of content elements within certain constraints. Most meth-ods for detection of content elements still adhere to a unimodal approach. Amultimodal approach might prove to be a fruitful extension. It allows to takeadditional context into account. Bringing the semantic index on a higher levelis the ultimate goal for multimodal analysis. This can be achieved by the in-tegrated use of different robust content detectors or by choosing a constraineddomain that ensures the best detection performance for a limited detector set.

2. Modality usage

Within the research field of multimodal video indexing, focus is still too muchgeared towards the visual and auditory modality. The semantic rich textual

Page 61: Multimedia Lecture Notes

6.5. CONCLUSION 61

modality is largely ignored in combination with the visual or auditory modal-ity. Specific content segmentation methods for the textual modality will havetheir reflection on the semantic index derived. Ultimately this will result insemantic descriptions that make a video document as accessible as a textdocument.

3. Multimodal integration

The integrated use of different information sources is an emerging trend invideo indexing research. All reported integration methods indicate an im-provement of performance. Most methods integrate in a symmetric and non-iterated fashion. Usage of incremental context by means of iteration can bea valuable addition to the success of the integration process. Most successfulintegration methods reported are based on the HMM and Bayesian networkframework, which can be considered as the current state-of-the-art in multi-modal integration. There seems to be a positive correlation between usage ofadvanced integration methods and multimodal video indexing results. Thispaves the road for the exploration of classifier combinations from the fieldof statistical pattern recognition, or other disciplines, within the context ofmultimodal video indexing.

4. Technique taxonomy

We presented a semantic index hierarchy that grouped different index types asfound in literature. Moreover we characterized the different genres in termsof their most prominent layout and content elements, and by splitting itsstructure into logical units and named events. What the field of video indexingstill lacks is a taxonomy of different techniques that indicates why a specifictechnique is suited the best, or unsuited, for a specific group of semantic indextypes.

The impact of the above mentioned factors on automatic indexing of video doc-uments will not only make the process more efficient and more effective than it isnow, it will also yield richer semantic indexes. This will form the basis for a rangeof new innovative applications.

Keyterms in this chapter

Shot boundary detection, edit detection, layout reconstruction, people detection, ob-ject detection, setting detection, conversion, synchronization, multimodal integra-tion, iterated/non-iterated integration, symmetric/a-symmetric integration, statis-tics or knowledge based integration, semantic video index, Logical Story Unit de-tection, overlapping links, visual distance, temporal distance, logical unit labelling,named event detection

Page 62: Multimedia Lecture Notes

62 CHAPTER 6. VIDEO INDEXING

Page 63: Multimedia Lecture Notes

Chapter 7

Data mining

1

Having stored all data and its classifications in the database we can explorethe database for interesting patterns, non-obvious groupings and the like. This isthe field known as data mining. In [156] four different classes of datamining aredistinguished:

• Classification learning : a learning scheme that takes a set of classified exam-ples from which it is expected to learn a way of classifying unseen instances.

• Association learning : learning any association between features, not just onesthat predict the classlabel.

• Clustering : grouping of instances that belong together as they share commoncharacteristics.

• Numeric prediction: predict an outcome of an attribute of an instance, wherethe attribute is not discrete, but a numeric quantity.

7.1 Classification learning

This type of data mining we already encountered in chapter 5. Hence, this could inprinciple be done at data import time. It can still be advantageous to do it afterdata import also. In the data-mining process using any of the techniques describedin the following sections might lead to new knowledge and new class definitions.Hence, new relations between the data and conceptual labels could be learned.Furthermore, if a user has used manual annotation to label the multimedia items inthe database with concepts from an ontology we can see whether based on the dataautomatic classification could be done. Finally, if the user would interact with thesystem in the data-mining process by labelling a limited number of multimedia itemswe could learn automatic classification methods by learning from these examples.This is commonly known as active learning.

The output of classification learning could be a decision tree or a classificationrule. Looking at the rules which are generated gives us an perspective on the dataand helps in understanding the relations between data, attributes, and concepts.

1This chapter is mostly based on [156] implementations of the techniques can be found athttp://www.cs.waikato.ac.nz/~ml/weka/index.html

63

Page 64: Multimedia Lecture Notes

64 CHAPTER 7. DATA MINING

Data New Knowledge

DataType Classification

Nominal

Ordinal

Interval

Ratio Machine Learning

Clusters

Assocation Rules

Regression Trees

Instance Based Representations

Knowledge Representation

Data Mining

Figure 7.1: Architecture of the data-mining modules

7.2 Association learning

Association learning has a strong connection to classification learning, but ratherthan predicting the class any attribute or set of attributes can be predicted. As anexample consider a set of analyzed broadcasts and viewing statistics indicating howmany people watched the program. Some associations that might be learned wouldbe:

if (commercial_starts) then (number-of-people-zapping="high")if (commercial_starts AND (previous-program=documentary)) then

(number-of-people-zapping="low")

When performing association learning we should be aware that we could alwaysyield an association between any two attributes. The target of course is to find theones that are meaningful. To that end two measures have been defined to measurethe validity of a rule:

• Coverage the number of instances which are correctly predicted by the rule.

• Accuracy the proportion of correctly predicted instances with respect to thetotal number of instances to which the rule applies.

Of course these two are conflicting. The smaller the number of instances therules applies to the easier it is to generate rules with sufficient accuracy. In it limitform, just write a separate rule for each instance leading to a accuracy of 100%, buta coverage of 1 for each rule. Practical association rule learning methods first finda large set rules with sufficient coverage and from there optimize the set of rules toimprove the accuracy.

Page 65: Multimedia Lecture Notes

7.3. CLUSTERING 65

Feat 1

Feat 2

Feat 1

Feat 2

Figure 7.2: Illustration of the initial steps in clustering based on the MinimumSpanning Tree. Note: two colors are used to indicate the notion of classes. However,at this point in the analysis no classes have been defined yet.

7.3 Clustering

We now turn our attention to the case where we want to find categories in a largeset of patterns without having a supervisor labelling the elements in the differentclasses. Although again there are many different methods, we only consider twomethods namely k-means and Minimum Spanning Tree based clustering.

In k-means the system has to be initialized by indicating a set of k representativesamples, called seeds, one for each of the expected categories. If such seeds cannotbe selected, the user should at least indicate how many categories are expected,the system then automatically selects k seeds for initialization. Now every patternis assigned to the nearest seed. This yields an initial grouping of the data. Foreach of the groups the mean is selected as the new seed for that particular cluster.Then the above process of assignment to the nearest seed and finding the mean isrepeated till the clusters become stable.

The Minimum Spanning Tree considers the two patterns which are closest toeach other and they are connected. Than the process is repeated for all other pairsof points till all patterns are connected. This leads to a hierarchical description ofthe data elements. One can use any level in the tree to decompose the dataset intoclusters. In figure 7.2 the process is illustrated.

7.4 Numeric prediction

Our final data mining algorithm considers numeric prediction. Whereas classifi-cation rules and association rules mostly deal with nominal or ordinal attributes,numeric prediction deals with numeric quantities. The most common rules are com-ing directly from statistics and are known there as linear regression methods. Let’sassume you want to predict some numeric attribute x. In linear regression we as-sume that for an instance j the attribute xj can be predicted by taking some linearcombination of the values of some k other attributes aj

i . Thus,

xi = w0 + w1ai1 + w2a

i2 + ... + wkai

k

where the wj are weights for the different attributes. Now when given a suffi-ciently large dataset where the ai

j and xi are given we have to find the wj in such away that for a new instance for which xi is not known we can predict xi by takingthe weighted sum defined above.

Page 66: Multimedia Lecture Notes

66 CHAPTER 7. DATA MINING

When we would have k training instances the solution to the above equation ingeneral can be solved uniquely and every xi for each of the k training samples wouldbe correctly estimated. This is however not desirable as a new instance would notbe predicted correctly. So in practice one takes a set of n samples where n >> kand computes the weights wj such that the prediction on the training set is as goodas possible. For that purpose the mean squared error is used:

n∑i=1

xi −k∑

j=0

wjak

2

Although this might look complicated standard algorithms for this are availableand in fact quite simple.

Keyterms in this chapter

Clustering,association rules,expectation maximization, linear regression, MinimumSpanning Tree, k-means

Page 67: Multimedia Lecture Notes

Chapter 8

Information visualization

8.1 Introduction

When a lot of information is stored in the database it is important for a user tobe able to visualize the information. Scheidermann1 makes this more precise anddefines the purpose of information visualization as:

• Provide a compact graphical presentation AND user interface for manipu-lating large numbers of items (102 - 106), possibly extracted from far largerdatasets.

• Enables users to make discoveries, decisions, or explanations about patterns(trend, cluster, gap, outlier...), groups of items, or individual items.

To do so a number of basic tasks are defined:

• Overview : Gain an overview of the entire collection

• Zoom: Zoom in on items of interest

• Filter : Filter out uninteresting items

• Details-on-demand : Select an item or group and get details when needed

• Relate: View relationships among items

And to support the overall process we need:

• History : Keep a history of actions to support undo, replay, and progressiverefinement

• Extract : Allow extraction of sub-collections and of the query parameters

Now let us consider the visualization process in more detail. The visualiza-tion module gets as input a structured dataset (e.g. the result of user interactionsee chapter 9) or a database request after which data and its structure (possiblyobtained by datamining) is retrieved from the database.

The first step is to generate a view on the dataset. Steered by parameters fromthe user, a selection of the data is made for visualization purposes catering for the5 basic tasks overview, zoom, filter, details-on-demand, and relate.

When visualizing information we have to take into account the large set ofpossible datatypes that we can encounter. The first categorization is formed by thedifferent standard types of information we considered in chapter 3.

1http://www.cs.umd.edu/hcil/pubs/presentations/INFORMSA3/INFORMSA3.ppt

67

Page 68: Multimedia Lecture Notes

68 CHAPTER 8. INFORMATION VISUALIZATION

• nominal, ordinal, interval, ratio

Next to these we have the multimedia types

• text, audio, image, video

For each of these items we will also have computed features and the similaritybetween different multimedia items. To be able to understand and explore thedatasets we need to visualize the data as well as the features and the similarity insuch a way that the user gets sufficient insight to make decisions.

The data may have an inherent structure e.g. as the corresponding ontologyis organized as a hierarchical tree, or the different items have been grouped intodifferent clusters. For structure the following three basic types are defined:

• Temporal structures: data with a temporal ordering e.g. elements related toa timeline.

• Tree structures: data with a hierarchical structure.

• Network structures: data with a graph-like structure e.g. a network wherethe nodes are formed by the multimedia items and where the weighted edgesare labelled with the similarity between the items.

Each of the datatypes has an intrinsic data dimension d. A set of nominal dataelements has d equal to 0 as elements are not ordered. For feature vectors of size ndescribing multimedia items, we have d. A set of ordinal elements has d.

There are different ways to visualize any of the above types, several of whichcan be found in literature. The choice for a specific type of visualization dependson the characteristics of the data, in particular d and on user preferences. Eachtype of visualization takes some datatype as input and generates a display spacewith perceived dimension d∗. E.g. a row of elements has d∗ = 1, a wireframe of acube in which data elements are placed has d∗ = 3. If the appearance of the cubeis changing over time the perceived dimension can go up to 4. Thus, for d ≤ 4 it isin principle possible to have d∗ = d. For higher dimensional data it is necessary toemploy a projection operator to reduce the number of dimensions to a maximum of4. For a vector space containing vectors typically principal component analysis isused to limit the number of dimensions to 2 or 3.

Clearly the perceived dimension is not equal to the screen dimensions whichfor regular displays is always 2, hence for every visualization type a mapping hasto be made to a 2D image or graph, possibly with an animation like use of time.Clearly it depends on the device characteristics, most importantly screen size, howthe mapping has to be done. Finally, all visualizations have to be combined andpresented to the user who will then navigate and issue new commands to changethe view and so on.

An overview of the visualization process is presented in figure 8.1.

8.2 Visualizing video content

2

Having considered the visualization of the whole information space we now con-sider the visualization of the multimedia items themselves, in particular video. Visu-alization of video content depends on the structural level of the unit. Visualizationof frame content is trivial. When aiming to visualize structure we should realize

2This is adapted from [146]

Page 69: Multimedia Lecture Notes

8.2. VISUALIZING VIDEO CONTENT 69

Data, Metadata

Structured DataSet

Data Request

2D-screen Visualization

View Generation

ViewParameters

Visualization Control

Data View

Data Type

Separation

nominal,ordinal, interval,ratio

text,audio, image,video

Temporal Structures Tree structures

Network structures Visualization

Type Selector

MappingTo 2D screen

coordinates (+time)

Combine Display Spaces

with perceived dimensions d*

SetOf 2D-displays

Information Visualization

User Preferences

Device Characteristics

Figure 8.1: Overview of the visualization process.

that an LSU has a tree structure and hence the above mentioned methods for treevisualization can be used. What remains, and what is specific for video, is thevisualization of the shot.

The pictorial content of a shot can be visualized in various ways. Showing thevideo as a stream is trivial. Stream visualization is used only when the user is verycertain the shot content is useful. The advantage of stream visualization is that itshows every possible detail in the video content. Also, it is easy to synchronize withthe other modalities, especially audio. The obvious disadvantage is the time neededto evaluate the shot content. Also, the viewer is forced to watch shots one at atime. A user is able to view several images in parallel, but for shots with movingcontent this is confusing.

Alternatively, a static visual video summary could be displayed, allowing theviewer to evaluate large video sequences quickly. The most important summaryvisualizations are key frames, mosaicking, and slice summaries.

Key frame visualization is the most popular visualization of shot content. Foreach shot one of more keyframes are selected representing in some optimal way thecontent of the shot. Key frames are especially suited for visualizing static videoshots, such as in dialog scenes, because the spatiotemporal aspect hardly plays arole then. In such cases, viewing one frame reveals the content of the entire shot.

Mosaicking combines the pictorial content of several (parts of) frames into oneimage. Salient stills3 [139] are a noteworthy instance of mosaicking in which theoutput image has the same size as the input frames. The advantage of mosaickingis the visualization beyond frame borders and the visualization of events by show-ing the same object several times at different positions. The disadvantage is itsdependence on uniform movement of camera or objects, making mosaicking onlyapplicable in very specific situations. A mosaic is shown in figure 8.2.

Slice summaries [147] are used for giving a global impression of shot content.They are based on just one line of pixels of every frame, meaning the value forthe vertical dimension is fixed in the case of a horizontal slice, which consequentlyhas the size width*length of the sequence. A vertical slice has size height*length.The fixed values are usually the center of the frame. Video slices were originallyused for detecting cuts and camera movement [154], but they are now used forvisualization as well, such as in the OM-representation [87]. In special cases, a

3See for examples http://www.salientstills.com/images/images other.html

Page 70: Multimedia Lecture Notes

70 CHAPTER 8. INFORMATION VISUALIZATION

Figure 8.2: An example of a mosaic.

Figure 8.3: An example of a vertical slice.

slice summary gives a very exact view on the visual content in a shot, so thatone understands immediately what is happening in the video. This does, however,require the object to move slowly relative to the camera in a direction perpendicularto the slice direction. The disadvantage of slice summaries is the assumption thatrelevant content is positioned in the slice areas. Although generally the region ofinterest is near the center of a frame, it is possible a slice summary misses the actionin the corners of a frame.

Even better is to use a dual slice summary that simultaneously shows both ahorizontal and a vertical video slice. An example of a vertical slice can be seen infigure 8.3.

The use of two directions increases the chance of showing a visually meaningfulsummary. The vertical slice provides most information about the video content sincein practice it is more likely that the camera or object moves from left to right thanfrom bottom to top. Slices in both directions provide supporting information aboutshot boundaries and global similarities in content over time. Thus, the viewer canevaluate the relationship between shots quickly, e.g. in a dialog. Recurrent visual

Page 71: Multimedia Lecture Notes

8.2. VISUALIZING VIDEO CONTENT 71

content can be spotted easily in a large number of frames, typically 600 frames ona 15 inch screen.

Finally, we can also create a video abstract which is a short video consisting ofparts from the original video. This is very much what you see when you view atrailer of a movie.

Keyterms in this chapter

Overview, filter, details, zoom, relate, tree structure, network structure, temporalstructure, intrinsic data dimension, perceived dimension,key frames, mosaic, slicesummary, video abstract.

Page 72: Multimedia Lecture Notes

72 CHAPTER 8. INFORMATION VISUALIZATION

Page 73: Multimedia Lecture Notes

Chapter 9

Interactive Content BasedMultimedia Retrieval

1

9.1 Introduction

Finding information in a multimedia archive is a notoriously difficult task due tothe semantic gap. Therefore, the user and the system should support each otherin finding the required information. Interaction of users with a data set has beenstudied most thoroughly in categorical information retrieval [98]. The techniquesreported there need rethinking when used for image retrieval as the meaning of animage, due to the semantic gap, can only be defined in context. Multimedia retrievalrequires active participation of the user to a much higher degree than required bycategorized querying. In content-based image retrieval, interaction is a complexinterplay between the user, the images, and their semantic interpretations.

9.2 User goals

Before defining software tools for supporting the interactive retrieval of multime-dia items we first consider the different goals a user can have when accessing amultimedia archive.

Assume a user is accessing a multimedia archive I in which a large collection ofmultimedia items are stored. We distinguish the following five different goals.

• answer search: finding an answer to a specific question by locating the re-quired evidence in the multimedia item set.

Finding an answer in a dataset is well known problem in text retrieval (see e.g.the Question and Answering track of the TREC conference [152]). Now consideran example question like “How many spikes are there on the statue of liberty”.If the system would have to answer this question based on an image archive, aset of pictures should be located on which the statue of liberty is shown fromdifferent sides. From there, in each image, the statue has to be segmented from thebackground, and from there the spikes have to be found. Clearly this is far beyondthe current state of the art in image databases. Therefore, we will not consider thisgoal here.

The first of the three goals we will consider is:1This chapter is adapted from [158].

73

Page 74: Multimedia Lecture Notes

74 CHAPTER 9. INTERACTIVE RETRIEVAL

• target search: finding a specific multimedia item in the database.

This is a task in which the assumption is that the user has a clear idea what heis looking for. For e.g. he wants to find the picture of the Annapurna in moonlightwhich he took on one of his holidays. The generic required answer has the followingform

At = m

with m ∈ I. The next search goal is:

• category search: finding a set of images from the same category.

As categories can be defined in various ways, this search goal covers a lot ofdifferent subgoals. The multimedia items can for example have a similar style e.g.a set of horror movies, contain similar objects e.g. pictures of cars in a catalogue,or be recorded in the same setting, like different bands at an outdoor concert. Itsgeneric form is:

Ac = {mi}i=1,..,ni

with mi ∈ I and ni the number of elements in the category searched for. Thefinal search goal that will be considered in this chapter is:

• associative search: a search with no other goal than interesting findings.

This is the typical browsing users do when surfing on the web or other het-erogeneous datasets. Users are guided by their associations and move freely froma multimedia item in one category to another. It can be viewed as multi-categorysearch. Users are not looking for specific multimedia items, thus whenever they findan interesting item any other element from the same category would be of interestalso. So the final generic form to consider is:

Aa = {{mij}j=1,..,ni}i=1,..,nc

with mij ∈ I, nc the number of categories the user is interested in, and ni thenumber of elements in category i.

.In addition to the above search tasks which are trying to find a multimedia item,

we can also consider

• object search: finding a subpart of the content of a multimedia item.

This goal addresses elements within the content of a multimedia item. E.g. auser is looking for a specific product appearing in a sitcom on television. In fact,when dealing with objects, one can again consider answer, target, and categorysearch. Conceptually the difference is not that large, but it can have a significantimpact in the design of methods supporting the different tasks. In this chapter wefocus on searching for multimedia items.

The goals mentioned are not independent. In fact they can be special casesof one another. If a user doing an associative search finds only one category ofmultimedia items of interest it reduces to category search. In turn, if this categoryconsists of a single element in the dataset, it becomes target search. Finally, if thetarget multimedia item contains one object only which can be clearly distinguishedfrom its background, it becomes object search. This is illustrated in figure 9.1. Inthe general case, the different goals have a great impact on the system supportingthe retrieval.

Page 75: Multimedia Lecture Notes

9.3. QUERY SPACE: DEFINITION 75

Image contains one object only

Associative Search

Target Search

Category Search

Object Search

Only one category of interest

Only one image in the desired category

Figure 9.1: Illustration of the relations between the different search goals

9.3 Query Space: definition

There are a variety of methods in literature for interactive retrieval of multimediaitems. Here we focus in particular on interactive image retrieval, but indicationswill be given how to generalize to other modalities.

9.3.1 Query Space: data items

To structure the description of methods we first define query space. The first com-ponent of query space is the selection of images IQ, from the large image archive I.Typically, the choice is based on factual descriptions like the name of the archive,the owner, date of creation, or web-site address. Any standard retrieval techniquecan be used for the selection.

9.3.2 Query Space: features

The second component is a selection of the features FQ ⊂ F derived from the imagesin IQ. In practice, the user is not always capable of selecting the features most fitto reach the goal. For example, how should a general user decide between imagedescription using the HSV or Lab color space? Under all circumstances, however,the user should be capable to indicate the class of features relevant for the task,like shape, texture, or both. In addition to feature class, [42] has the user indicatethe requested invariance. The user can, for example, specify an interest in featuresrobust against varying viewpoint, while the expected illumination is specified aswhite light in all cases. The appropriate features can then automatically be selectedby the system.

Page 76: Multimedia Lecture Notes

76 CHAPTER 9. INTERACTIVE RETRIEVAL

I

S

F

F1

F2

Z

Label 1 label 2

P1

P2

Figure 9.2: Illustration of query space showing a set I consisting of 6 images only,for which two features are computed, and two labels are present for each image.

9.3.3 Query Space: similarity

As concerns the third component of query space, the user should also select asimilarity function, SQ. To adapt to different data-sets and goals, SQ should bea parameterized function. Commonly, the parameters are weights for the differentfeatures.

9.3.4 Query Space: interpretations

The fourth component of query space is a set of labels ZQ ⊂ Z to capture goal-dependent semantics. Given the above, we define an abstract query space:

The query space Q is the goal dependent 4-tuple {IQ,FQ,SQ,ZQ}

The concept of query space is illustrated in figure 9.2.

9.4 Query Space: processing steps

When a user is interactively accessing a multimedia dataset the system performs aset of five processing steps namely initialization, specification, visualization, feed-back, and output. Of these five steps the visualization and feedback step formthe iterative part of the retrieval task. In each of the steps every component ofthe query space can be modified. The remaining three steps are relevant for anyretrieval system These processing steps are illustrated in figure 9.3.

9.4.1 Initialization

To start a query session, an instantiation Q = {IQ, FQ, SQ, ZQ} of the abstractquery space is created. When no knowledge about past or anticipated use of thesystem is available, the initial query space Q0 should not be biased towards specificimages, or make some image pairs a priori more similar than others. The active setof images IQ is therefore equal to all of IQ. Furthermore, the features of FQ are

Page 77: Multimedia Lecture Notes

9.4. QUERY SPACE: PROCESSING STEPS 77

Multimedia Descriptions

Information Request

Results

Feedback

QuerySpec

QuerySpace Initialization

FeedBack Processing

Query Specification QuerySpace

Output Final

QuerySpace

Updated Queryspace PrepareFor

Display Initial Queryspace

Structured DataSet

Interaction Processing

Figure 9.3: The processing steps in interactive content based retrieval.

normalized based on the distribution of the feature values over IQ e.g. [39, 112]. Tomake SQ unbiased over FQ, the parameters should be tuned, arriving at a naturaldistance measure. Such a measure can be obtained by normalization of the similaritybetween individual features to a fixed range [149, 112]. For the instantiation of asemantic label, the semantic gap prevents attachment to an image with full certainty.Therefore, in the ideal case, the instantiation ZQ of ZQ assigns for each i ∈ IQ andeach z ∈ ZQ a probability Pi(z) rather than a strict label. Without prior informationno labels are attached to any of the multimedia items. However, one can have anautomatic system which indexes the data for example the ones discussed in [131]for video indexing. As this is the result of an automatic process P captures theuncertainty of the result. If a user has manually annotated a multimedia item iwith label z this can be considered as ground truth data and Pi(z) = 1

The above is of course not the ideal situation. While currently systems for inter-active retrieval are only starting to appear it can be expected to be commonplacein the years to come. Thus, there might already be a large of amount of previoususe of the systems. The results of these earlier sessions can be stored in the formof one or more profiles. Any of such profiles can be considered as a specification ofhow the query space should be initialized.

The query space forms the basis for specifying queries, display of query results,and for interaction.

9.4.2 Query specification

For specifying a query q in Q, many different interaction methodologies have beenproposed. A query falls in one of two major categories:

• exact query, where the query answer set A(q) equals the images in IQ satisfyinga set of given criteria.

• approximate query where A(q) is a ranking of the images in IQ with respect

Page 78: Multimedia Lecture Notes

78 CHAPTER 9. INTERACTIVE RETRIEVAL

Example query

Spatial predicate

Spatial example

Image predicate

Group predicate

Image example

Group example

exact

appro

xim

ate

water

sun

Amount of “sky”>20% and

amount of “sand” > 30%

Location = “Africa”

Example query result

pos neg

...

...

...

Figure 9.4: Example queries for each of the 6 different query types and possibleresults from the Corel image database.

to the query.

Within each of the two categories, three subclasses can be defined depending onwhether the query relates to the spatial content of the image, to the global imageinformation, or to groups of images. An overview of specification of queries in imageretrieval is shown in figure 9.4. Figure 9.5 and figure 9.6 show the equivalent classesfor video and text retrieval respectively.

expressed in terms of the standard boolean operators and, or and not. Whatmakes them special is the use of the specific image based predicates. There arethree different exact query classes:

• Exact query by spatial predicate is based on the location of silhouettes i.e.the result of strong segmentation, homogeneous regions resulting from strongsegmentation, or signs2. Query on silhouette location is applicable in narrowdomains only. Typically, the user queries using an interpretation z ∈ ZQ.To answer the query, the system then selects an appropriate algorithm forsegmenting the image and extracting the domain-dependent features. In [126]the user interactively indicates semantically salient regions to provide a start-ing point. The user also provides sufficient context to derive a measure forthe probability of z. Implicit spatial relations between regions sketched by theuser in [130] yield a pictorial predicate. Other systems let the user explicitlydefine the predicate on relations between homogeneous regions [21]. In bothcases, to be added to the query result, the homogenous regions as extractedfrom the image must comply with the predicate. A web search system inwhich the user places icons representing categories like human, sky, and wa-ter in the requested spatial order is presented in [72]. In [117], users posespatial-predicate queries on geographical signs located in maps based on theirabsolute or relative positions.

2A sign is an object in the image with fixed shape and orientation e.g. a logo.

Page 79: Multimedia Lecture Notes

9.4. QUERY SPACE: PROCESSING STEPS 79

Example query

Spatial predicate

Spatial example

Image predicate

Group predicate

Video example

Group example

exact

appro

xim

ate

water

sun

Average object speed > 1 m/s

amount of “sand” > 30%

Location = “Africa”

Example query result

pos neg

...

...

...

Moving-towards

Figure 9.5: idem for video.

Example query

Spatial predicate

Spatial example

Text predicate

Group predicate

Text example

Group example

exact

appro

xim

ate

Example query result

pos neg

...

...

...

Sun <before> water

Sun <more frequent than> water

Text <about> Africa

This is just one text about thedogons in Mali, West Africa.

This is a text concerning sun and water isthis order.

This is a text in which the sun is mentionedtwice as the sun is important. Water is onlymentioned once.

This is another text aboutsun and water.

This is another text about sun andfun at the beach, but not about thesun as an energy source.

In this text sun is also mentioned beforethe word water is.A piece of text in which the word

sun is mentioned earlier than water

This is a text about sun and water

Text about sunand beaches

Text about thesun and energy

Figure 9.6: idem for text.

Page 80: Multimedia Lecture Notes

80 CHAPTER 9. INTERACTIVE RETRIEVAL

• Exact query by image predicate is a specification of predicates on global imagedescriptions, often in the form of range predicates. Due to the semantic gap,range predicates on features are seldom used in a direct way. In [96], rangeson color values are pre-defined in predicates like “MostlyBlue” and “SomeYel-low”. Learning from user annotations of a partitioning of the image allows forfeature range queries like: “amount of sky >50% and amount of sand >30%”[107].

• Exact query by group predicate is a query using an element z ∈ ZQ where ZQis a set of categories that partitions IQ. Both in [23] and [143] the user querieson a hierarchical taxonomy of categories. The difference is that the categoriesare based on contextual information in [23] while they are interpretations ofthe content in [143].

In the approximate types of query specifications the user specifies a single featurevector or one particular spatial configuration in FQ, where it is anticipated that noimage will satisfy the query exactly.

• Approximate query by spatial example results in an image or spatial structurecorresponding to literal image values and their spatial relationships. Pictorialspecification of a spatial example requires a feature space such that featurevalues can be selected or sketched by the user. Low-level feature selectors usecolor pickers, or selections from shape and texture examples [39, 46]. Kato[64] was the first to let users create a sketch of the global image compositionwhich was then matched to the edges in IQ. Sketched outlines of objects in[70] are first normalized to remove irrelevant detail from the query object,before matching it to objects segmented from the image. When specificationis by parameterized template [32, 123] each image in IQ is processed to findthe best match with edges of the images. The segmentation result is improvedif the user may annotate the template with salient details like color cornersand specific textures. Pre-identification of all salient details in images in IQcan then be employed to speed up the search process [127]. When weaksegmentation of the query image and all images in IQ is performed, the usercan specify the query by indicating example regions [21, 130].

• Approximate query by image example feeds the system a complete array ofpixels and queries for the most similar images, in effect asking for the k-nearestneighbors in feature space. Most of the current systems have relied upon thisform of querying [39, 46]. The general approach is to use a SQ based on globalimage features. Query by example queries are subclassified [149] into query byexternal image example, if the query image is not in the database, versus queryby internal image example. The difference in external and internal exampleis minor for the user, but affects the computational support as for internalexamples all relations between images can be pre-computed. Query by imageexample is suited for applications where the target is an image of the sameobject or set of objects under different viewing conditions [42]. In other cases,the use of one image cannot provide sufficient context for the query to selectone of its many interpretations [118].

• Approximate image query by group example is specification through a selectionof images which ensemble defines the goal. The rationale is to put the image inits proper semantic context to make one of the possible interpretations z ∈ ZQclearly stand out with respect to other possible interpretations. One optionis that the user selects m > 1 images from a palette of images presented tofind images best matching the common characteristics of the m images [29].

Page 81: Multimedia Lecture Notes

9.4. QUERY SPACE: PROCESSING STEPS 81

An m-query set is capable of defining the target more precisely. At the sametime, the m-query set defines relevant feature value variations and nullifiesirrelevant variations in the query. Group properties are amplified further byadding negative examples. This is achieved in [7] by constructing a query qbest describing positive and negative examples indicated by the user. Whenfor each group in the database a small set of representative images can befound, they are stored in a visual dictionary from which the user creates thequery [118].

Of course, the above queries can always be combined into more complex queries.For example, both [21, 130] compare the similarity of regions using features. Inaddition, they encode spatial relations between the regions in predicates. Evenwith such complex queries, a single query q is rarely sufficient to make A(q) theuser desired answer set. For most image queries, the user must engage in activeinteraction with the system on the basis of the query results as displayed.

9.4.3 Query output

In most current retrieval systems the output is an unordered set of multimedia itemsfor exact queries and a ranked set of images for approximate search. However, whenconsidering the different user search goals identified in section 9.2 the output shoulddepend on the user goal. As the system should base its result on query space, welet the output or answer operator A work on query space rather than on the queryas used in section 9.4.2. To differentiate between the different goals we again usethe superscript t for target, c for category, and a for associative. It leads to thefollowing four operators, where we assume that the number of answers is given byn:

For target search the operator is very simple as only one element is requested,thus the most likely one is selected.

At(Q) = arg maxm∈IQ

{P (m = target)}

Category search yields a set of multimedia items, which should all be in the samecategory. To generate the answer the user should specify the label z correspondingto the category and a threshold Tc on the probability measure for the label on amultimedia item:

Ac(Q, z, Tc) = {m ∈ IQ|Pm(z) ≥ Tc}

Finally the associative search yields a set of results for a vector of labels (z1, ...., zn)corresponding to interesting categories:

Aa(Q, (z1, ...., zn)), Tc) = (Ac(Q, zi, Tc))i=1,n

9.4.4 Query output evaluation

Having arrived at a final output after the search quest of the user, we can evaluatethe search performance of the system. In information retrieval there are two basicmeasures namely precision and recall. The first measure tells us how much of theresults returned are indeed correct. The second measure quantifies how many of therelevant items in the database we actually found. In practice, the two measure arecompeting, if the parameters of the system are tuned to favor precision, the recallwill go down and vice versa.

To make this more precise. Let A be the user goal, and let A denote the resultof the system. Precision p and recall are then given as:

Page 82: Multimedia Lecture Notes

82 CHAPTER 9. INTERACTIVE RETRIEVAL

p =|A ∩ A||A|

r =|A ∩ A||A|

where |.| denotes the number of elements in a set.So let us now see how we can apply these measures for evaluation of the three

search goals namely target, category, and associative search.For target search it is very easy to evaluate the result as either the target is

found or not, we can simply fill in the definitions defined earlier for formalizing thesearch and the output:

pt(Q) =|At(Q) ∩ At(Q)|

|At(Q)|

rt(Q) =|At(Q) ∩ At(Q)|

|At(Q)|

Clearly for target search this value is either 0 (found) or 1 (not found).For category search precision and recall can also be used directly:

pc(Q, z, Tc) =|Ac(Q, z, Tc) ∩ Ac(Q, z, Tc)|

|Ac(Q, z, Tc)|

rc(Q, z, Tc) =|Ac(Q, z, Tc) ∩ Ac(Q, z, Tc)|

|Ac(Q, z, Tc)|

Evaluating associative search is somewhat more difficult. As we have interpretedthis task as multi-category search we have to consider precision and recall for alldifferent categories. Thus we get:

pa(Q, (z1, ...., zn), Tc) = (pc(Q, zi, Tc))i=1,n

ra(Q, (z1, ...., zn), Tc) = (rc(Q, zi, Tc))i=1,n

For object search two types of evaluation are needed. First, we can use theevaluation measures given for category search to evaluate whether the right setof multimedia items is found. Given that a multimedia items contains at leastone instance of the object searched for, we should evaluate how well the object islocalized. This can in fact be done with two measures which are closely relatedto precision and recall measuring the overlap between the objects found and thedesired object.

9.5 Query space interaction

In early systems, the process of query specification and output is iterated, where ineach step the user revises the query. Updating the query is often still appropriatefor exact queries. For approximate queries, however, the interactive session shouldbe considered in its entirety. During the session the system updates the query space,attempting to learn the goals from the user’s feedback. This leads to the frameworkdepicted in figure 9.3, including all 5 processing steps mentioned earlier.

We define:

Page 83: Multimedia Lecture Notes

9.5. QUERY SPACE INTERACTION 83

An interactive query session is a sequence of query spaces {Q0, Q1, ...., Qn−1, Qn}where the interaction of the user yields a relevance feedback RFi in everyiteration i of the session.

In this process, the transition from Qi to Qi+1 materializes the feedback of theuser. In a truly successful session A(Qn) = A(Q) i.e. the user had a search goalbased on the initial query space he/she started with, and the output on the lastquery space does indeed give this goal as an answer is the user’s search goal.

In the following subsections we will first consider what the optimal change ineach step of the interaction process is. Then we consider how the query space can bedisplayed as this is the means through which the user can interact. We subsequentlyconsider how the user can give feedback and finally we explain how the query spacecan be updated based on the users’ feedback and how this whole process should beevaluated.

9.5.1 Interaction goals

In the process of interaction the query space should get closer and closer to theresult requested by the user. We now consider what this means for the query spaceupdate in every step, where we make a distinction between the different searchgoals.

When doing an object search the set I should be reduced in every step in sucha way that no multimedia items containing the object are removed from the queryspace. The features F should be such that within the items containing the objectthe object is optimally discriminated from the background. Similarity S should beputting items containing the object closer to each other, while putting them furtheraway from the other items. Finally, items containing the object should have a highprobability in Z while others get a lower probability.

For target search we encounter similar things. I should ideally be reduced tothe single target item. The features F should be such that they highlight importantinformation in the target item. S should make the target stand out clearly fromthe other items. When the label target is used clearly the probability of the targetshould go to 1, while the probability for other items goes to 0.

Category search is not much different from target search and hence the goals arevery similar, now focussing on a category of items. As browsing can be viewed asmulti-category search it yields a similar analysis, however, as the desired categoriesmight in fact be very different from one another we have to allow for multiplepossible query spaces, which at output time should be combined.

9.5.2 Query space display

There are several ways to display the query space to the user (see chapter 8). Inaddition, system feedback can be given to help the user in understanding the result.Recall that d∗ as defined chapter 8 is equal to the intrinsic dimension of the data.Hence, for our purposes we have to consider the intrinsic dimension of the queryspace, or as the initial query space is the result of a query, equal to the intrinsicdimension of the query result.

When the query is exact, the result of the query is a set of images fulfilling thepredicate. As an image either fulfills the predicate or not, there is no intrinsic orderin the query result and hence d = 0.

For approximate queries, the images in IQ are given a similarity ranking based onSQ with respect to the query. In many systems the role of V is limited to boundingthe number of images displayed which are then displayed in a 2D rectangular grid[39, 23]. Note, however that we should have d = 1. If the user refines its query

Page 84: Multimedia Lecture Notes

84 CHAPTER 9. INTERACTIVE RETRIEVAL

using query by example, the images displayed do not have to be the images closestto the query. In [149], images are selected that together provide a representativeoverview of the whole active database. An alternative display model displays theimage set minimizing the expected number of total iterations [29].

The space spanned by the features in FQ is a high-dimensional space. Whenimages are described by feature vectors, every image has an associated position inthis space. In [118, 138, 54] the operator V maps the high-dimensional feature spaceonto a display space with d = 3. Images are placed in such a way that distancesbetween images inD reflect SQ. A simulated fisheye lens is used to induce perceptionof depth in [138]. In the reference, the set of images to display depends on howwell the user selections conform to selections made in the community of users. Toimprove the user’s comprehension of the information space, [54] provides the userwith a dynamic view on FQ through continuous variation of the active feature set.The display in [63] combines exact and approximate query results. First, the imagesin IQ are organized in 2D-layers according to labels in ZQ. Then, in each layer,images are positioned based on SQ.

In exact queries based on accumulative features, back-projection [136, 36] can beused to give system feedback, indicating which parts of the image fulfill the criteria.For example, in [107] each tile in the partition of the image shows the semantic labellike sky, building, or grass the tile received. For approximate queries, in addition tomere rank ordering, in [21] system feedback is given by highlighting the subparts ofthe images contributing most to the ranking result.

9.5.3 User feedback

For target search, category search, or associative search various ways of user feed-back have been considered. All are balancing between obtaining as much informa-tion from the user as possible and keeping the burden on the user minimal. Thesimplest form of feedback is to indicate which images are relevant [29]. In [26, 81],the user in addition explicitly indicates non-relevant images. The system in [112]considers five levels of significance, which gives more information to the system,but makes the process more difficult for the user. When d ≥ 2, the user can ma-nipulate the projected distances between images, putting away non-relevant imagesand bringing relevant images closer to each other [118]. The user can also explicitlybring in semantic information by annotating individual images, groups of images,or regions inside images [83] with a semantic label. In general, user feedback leadsto an update of query space:

{IiQ, F i

Q, SiQ, Zi

Q}RFi−→ {Ii+1

Q , F i+1Q , Si+1

Q , Zi+1Q } (9.1)

Different ways of updating Q are open. In [149] the images displayed correspondto a partitioning of IQ. By selecting an image, one of the sets in the partition isselected and the set IQ is reduced. Thus the user zooms in on a target or a category.

The methods follows the pattern:

IiQ

RFi−→ Ii+1Q (9.2)

In current systems, the feature vectors in FQ, corresponding to images in IQ

are assumed fixed. This has great advantages in terms of efficiency. When featuresare parameterized, however, feedback from the user could lead to optimization ofthe parameters. For example, in parameterized detection of objects based on salientcontour details, the user can manipulate the segmentation result to have the systemselect a more appropriate salient detail based on the image evidence [127]. Thegeneral pattern is:

F iQ

RFi−→ F i+1Q (9.3)

Page 85: Multimedia Lecture Notes

9.5. QUERY SPACE INTERACTION 85

For associative search users typically interact to learn the system the right as-sociations. Hence, the system updates the similarity function:

SiQ

RFi−→ Si+1Q (9.4)

In [26, 112] SQ is parameterized by a weight vector on the distances betweenindividual features. The weights in [26] are updated by comparing the variance ofa feature in the set of positive examples to the variance in the union of positiveand negative examples. If the variance for the positive examples is significantlysmaller, it is likely that the feature is important to the user. The system in [112]first updates the weight of different feature classes. The ranking of images accordingto the overall similarity function is compared to the rankings corresponding to eachindividual feature class. Both positive and negative examples are used to computethe weight of the feature, computed as the inverse of the variance over the positiveexamples. The feedback RFi in [118] leads to an update of the user-desired distancesbetween pairs of images in IQ. The parameters of the continuous similarity functionshould be updated to match the new distances. A regularization term is introducedlimiting the deviation from the initial natural distance function. This ensures abalance between the information the data provides and impact of the feedback ofthe user. With such a term the system can “smoothly” learn the feedback from theuser.

The final set of methods follow the pattern:

ZiQ

RFi−→ Zi+1Q (9.5)

The system in [83] pre-computes a hierarchical grouping of partitionings (orimages for that matter) based on the similarity for each individual feature. Thefeedback from the user is employed to create compound groupings corresponding toa user given z ∈ ZQ. The compound groupings are such that they include all of thepositive and none of the negative examples. Unlabelled images in the compoundgroup receive label z. The update of probabilities P is based on different partition-ings of IQ. For category and target search, a system may refine the likelihood of aparticular interpretation, updating the label based on feature values or on similarityvalues. The method in [81] falls in this class. It considers category search, whereZQ is {relevant, non-relevant}. In the limit case for only one relevant image, themethod boils down to target search. All images indicated by the user as relevant ornon-relevant in current or previous iterations are collected and a Parzen estimator isconstructed incrementally to find optimal separation between the two classes. Thegeneric pattern which uses similarity in updating probabilities is the form used in[29] for target search with ZQ = {target}. In the reference an elaborate Bayesianframework is derived to compute the likelihood of any image in the database tobe the target, given the history of actions RFi. In each iteration, the user selectsexamples from the set of images displayed. Image pairs are formed by taking oneselected and one displayed, but non-selected image. The probability of being thetarget for an image in IQ is increased or decreased depending on the similarity tothe selected and the non-selected example in the pair.

9.5.4 Evaluation of the interaction process

In section 9.4.4 we evaluated the output of the search task. When the user isinteracting with the system this is not sufficient. Given some output performanceit makes a great difference whether the user needed one or two iteration steps toreach this result, or was interacting over and over to get there.

So one evaluation measure to consider is:

Page 86: Multimedia Lecture Notes

86 CHAPTER 9. INTERACTIVE RETRIEVAL

• Iteration count : the number of iterations needed to reach the result.

However, this is not sufficient, we can design very simple iteration strategieswhere the user selects one multimedia item to continue, or a very elaborate onewhere the user has to indicate the relevance of each and every item on display.

As it is hard to compare the different feedback mechanisms we take a generalapproach and make the assumption that for a user the most difficult step is to assesswhether a multimedia item is relevant, whether the similarity between two items iscorrect etc. Thus we consider the following general evaluation measure:

• Assessment count : the number of times a user had to assess some element inthe query space.

Clearly, like in the case of precision and recall, the design of an iteration processwill always be a balance between iteration and assessment count. Reducing one willintroduce an increase in the other.

9.6 Discussion and conclusion

We consider the emphasis on interaction in image retrieval as one of the ma-jor changes with the computer vision tradition, as was already cited in the 1992workshop [58]. Interaction was first picked up by frontrunners, such as the NEC-laboratories in Japan and the MIT-Medialab to name a few. Now, interaction andfeedback have moved into the focus of attention. Putting the user in control and vi-sualization of the content has always been a leading principle in information retrievalresearch. It is expected that more and more techniques from traditional informa-tion retrieval will be employed or reinvented, in content-based image retrieval. Textretrieval and image retrieval share the need for visualizing the information contentin a meaningful way as well as the need to accept a semantic response of the userrather than just providing access to the raw data.

User interaction in image retrieval has, however, some different characteristicsfrom text retrieval. There is no sensory gap and the semantic gap from keywordsto full text in text retrieval is of a different nature. No translation is needed fromkeywords to pictorial elements.

A critical point in the advancement of content-based retrieval is the semanticgap, where the meaning of an image is rarely self-evident.

Use of content-based retrieval for browsing will not be within the grasp of thegeneral public as humans are accustomed to rely on the immediate semantic imprintthe moment they see an image, and they expect a computer to do the same. Theaim of content-based retrieval systems must be to provide maximum support inbridging the semantic gap between the simplicity of available visual features andthe richness of the user semantics.

Any information the user can provide in the search process should be employedto provide the rich context required in establishing the meaning of a picture. Theinteraction should form an integral component in any modern image retrieval sys-tem, rather than a last resort when the automatic methods fail. Already at thestart, interaction can play an important role. Most of current systems performquery space initialization irrespective of whether a target search, a category search,or an associative search is requested. But the fact of the matter is that the set ofappropriate features and the similarity function depend on the user goal. Askingthe user for the required invariance, yields a solution for a specific form of targetsearch. For category search and associative search the user-driven initialization ofquery space is still an open issue.

Page 87: Multimedia Lecture Notes

9.6. DISCUSSION AND CONCLUSION 87

For image retrieval we have identified six query classes, formed by the Cartesianproduct of the result type {exact, approximate} and the level of granularity of thedescriptions {spatial content, image, image groups}. The queries based on spatialcontent require segmentation of the image. For large data sets such queries areonly feasible when some form of weak segmentation can be applied to all images,or when signs are selected from a predefined legend. A balance has to be foundbetween flexibility on the user side and scalability on the system side. Query byimage example has been researched most thoroughly, but a single image is onlysuited when another image of the same object(s) is the aim of the search. Inother cases there is simply not sufficient context. Queries based on groups as wellas techniques for prior identification of groups in data sets are promising lines ofresearch. Such group-based approaches have the potential to partially bridge thesemantic gap while leaving room for efficient solutions.

Due to the semantic gap, visualization of the query space in image retrieval is ofgreat importance for the user to navigate the complex query space. While currently2- or 3-dimensional display spaces are mostly employed in query by association,target and category search are likely to follow. It if often overlooked that the queryresult has an inherent display dimension Most methods simply display images ina 2D grid. Enhancing the visualization of the query result is, however, a valuabletool in helping the user-navigating query space. As apparent from the query spaceframework, there is an abundance of information available for display. New visual-ization tools are urged for to allow for user- and goal-dependent choices on what todisplay.In all cases, an influx of computer graphics and virtual reality is foreseen inthe near future.

Through manipulation of the visualized result, the user gives feedback to thesystem. The interaction patterns as enumerated here reveal that in current methodsfeedback leads to an update of just one of the components of query space. There isno inherent reason why this should be the case. In fact, joint updates could indeedbe effective and well worth researching. For example the pattern which updatescategory membership based on a dynamic similarity function, would combine theadvantages of browsing with category and target search.

One final word about the impact of interactivity on the architecture of thesystem. The interacting user brings about many new challenges for the responsetime of the system. Content-based image retrieval is only scalable to large data setswhen the database is able to anticipate on what interactive queries will be made. Afrequent assumption is that the image set, the features, and the similarity functionare known in advance. In a truly interactive session, the assumptions are no longervalid. A change from static to dynamic indexing is required.

Interaction helps a great deal in limiting the semantic gap. Another importantdirection in limiting the gap is by making use of more than one modality. Often apicture has a caption, is part of a website etc. This context can be of great help.Although there are currently several methods for indexing which do take multiplemodalities into account there are no methods which integrate the modalities ininteractive search.

Keyterms in this chapter

Answer search, target search, category search, associative search, object search,query space, exact query, approximate query, spatial predicate, image predicate,group predicate, spatial example, image example, group example, interactive querysession, system feedback, relevance feedback, precision, recall, iteration count, as-sessment count, visualization operator, display space, perceived dimension, projec-tion operator, screen space

Page 88: Multimedia Lecture Notes

88 CHAPTER 9. INTERACTIVE RETRIEVAL

Page 89: Multimedia Lecture Notes

Chapter 10

Data export

10.1 Introduction

When we aim to distribute items in the form of multimedia documents from thedatabase e.g. the result of some specific query, or if we want to put a multimediadocument on the web, the format should be self-descriptive so that the client endcan interpret it in the correct way. Proprietary standards require to use the sametools at the client and the server side. The use of standards make this far easieras the client and server are free in choosing their tools as long as they conform tothe standard. Of course currently the most important standard used for exchangeis XML. For use in a multimedia context several extensions have been proposed.Given the nature of multimedia documents we need formats for exchanging thefollowing:

• multimedia document content : formats dealing with the different forms ofdata representations and content descriptors.

• multimedia document layout : formats dealing with the spatial and temporallayout of the document.

There are a number of different standards that consider the two elements above.We focus on MPEG-7 [80] for content description and SMIL as a standard formultimedia document layout.

10.2 MPEG-7

MPEG-7 is an ISO standard which provides a rich set of tools for completely de-scribing multimedia content. It does so at all the levels of description identified inchapter 3 i.e. at the conceptual, perceptual, and non audio-visual level, in additionit provides descriptions which are aimed more at the data management level:

• Classical archival oriented description i.e. non audio-visual descriptions.

– Information regarding the content’s creation and production e.g. direc-tor, title, actors, and location.

– Information related to using the content e.g. broadcast schedules andcopyright pointers.

– Information on storing and representing the content e.g. storage formatand encoding format.

89

Page 90: Multimedia Lecture Notes

90 CHAPTER 10. DATA EXPORT

• Perceptual descriptions of the information in the content

– Information regarding the content’s spatial, temporal, or spatio-temporalstructure e.g. scene cuts, segmentation in regions, and region tracking.

– information about low-level features in the content e.g. colors, textures,sound timbres, and melody descriptions.

• Descriptions at the semantic level

– semantic information related to the reality captured by the content e.g.objects, events, and interactions between objects.

• Information for organizing, managing, and accessing the content.

– Information about how objects are related and gathered in collections.

– Information to support efficient browsing of content e.g. summaries,variations, and transcoding information.

– Information about the interaction of the user with the content e.g. userpreferences and usage history.

As an example, below is the annotation of a two shot video where the first isannotated with some information. Note that the description is bulky, but containsall information required.

<?xml version="1.0" encoding="iso-8859-1"?> <Mpeg7

xmlns="urn:mpeg:mpeg7:schema:2001"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:mpeg7="urn:mpeg:mpeg7:schema:2001"

xsi:schemaLocation="urn:mpeg:mpeg7:schema:2001 Mpeg7-2001.xsd">

<Description xsi:type="ContentEntityType">

<MultimediaContent xsi:type="VideoType">

<Video>

<TemporalDecomposition>

<VideoSegment>

<TextAnnotation type="scene" relevance="1" confidence="1">

<FreeTextAnnotation>Outdoors</FreeTextAnnotation>

<FreeTextAnnotation>Nature(High-level)</FreeTextAnnotation>

<FreeTextAnnotation>Forest</FreeTextAnnotation>

</TextAnnotation>

<MediaTime>

<MediaTimePoint>T00:00:00:0F30</MediaTimePoint>

<MediaIncrDuration mediaTimeUnit="PT1N30F">160</MediaIncrDuration>

</MediaTime>

<TemporalDecomposition>

<VideoSegment>

<MediaTime>

<MediaTimePoint>T00:00:02:19F30</MediaTimePoint>

</MediaTime>

</VideoSegment>

</TemporalDecomposition>

</VideoSegment>

</TemporalDecomposition>

</Video>

</MultimediaContent>

</Description>

</Mpeg7>

Page 91: Multimedia Lecture Notes

10.3. SMIL: SYNCHRONIZED MULTIMEDIA INTEGRATION LANGUAGE91

It should be noted that MPEG-7 is not geared towards a specific application,but it aims to provide a generic framework to facilitate exchange and reuse ofmultimedia content across different application domains.

10.3 SMIL: Synchronized Multimedia IntegrationLanguage

To describe SMIL we will follow [19] and [20]. In the reference SMIL is defined as:

SMIL is a collection of XML elements and attributes that we can useto describe the temporal and spatial coordination of one or more mediaobjects.

It has been implemented in many mediaplayers, in particular in the RealPlayerand to some extent in Internet Explorer.

An important component of the format is the definition of timing and synchro-nization of different media components. To that end three different basic mediacontainers are defined:

• seq, or sequential time container where the different elements in the containerare played in the order given.

• par, or parallel time container indicating that the elements in the containercan be played at the same time.

• excl, or exclusive container indicating that only one of the elements of thecontainer can be active at some given point int time.

To specify the timing of an individual media item or the container as a whole, theattributes begin, end, and dur (for duration) are used. The above containers can benested to create a presentation hierarchy. To that end, it should be noted that thetiming mentioned above can be dynamic. The composite timeline is computed byconsidering the whole hierarchy. For example the duration of a par container equalsthe time of the longest playing media item. In many cases this will not be known tillthe document is actually played. Individual items are specified with relative startsand ends. For example an item in a seq sequence can indicate that it will start2 seconds after the preceding one. Finally, in a excl container the item to selectis typically based on some event like another item that starts playing or a buttonselected by the user. Examples of the three containers are given in figure 10.1.

In a perfect world any system using any connection should be able to adhere tothe timings as given in the attributes. However, this can not always be assured.SMIL deals with these problems in two different ways. First, it has three high levelattributes to control synchronization:

• syncBehavior : lets a presentation define whether there can be slippage inimplementing the presentation’s composite timeline.

• syncTolerance: defines how much slip is allowed

• syncMaster : lets a particular element become the master timebase againstwhich all others are measured.

The above deals with small problems occurring over the communication channel.However, SMIL also allows to define the presentation in such a way that it canadapt to the particular device or communication channel used. In his manner a

Page 92: Multimedia Lecture Notes

92 CHAPTER 10. DATA EXPORT

a c b

a

c

b

a

c

b

<seq> <img id="a" dur="6s" begin="0s" src="..." /> <img id="b" dur="4s" begin="0s" src="..." /> <img id="c" dur="5s" begin="2s" src="..." />

</seq>

<par> <img id="a" dur="6s" begin="0s" src="..." /> <img id="b" dur="4s" begin="0s" src="..." /> <img id="c" dur="5s" begin="2s" src="..." />

</par>

<excl> <img id="a" dur="6s" src="..." begin="0s";button1.activateEvent/> <img id="b" dur="4s" src="..." begin="0s";button2.activateEvent/> <img id="c" dur="5s" src="..." begin="2s";button3.activateEvent/>

</excl>

Figure 10.1: Abstract examples with timelines and associated code of the seq, par,and excl container.

presentation can be developed that can be run both on a full fledged PC where forexample a complete video presentation is shown and on a mobile phone where a slideshow with thumbnail images is presented. The second solution provided by SMILis in the form of the <switch> statement. In this way we can for example test onthe systemBitrate (typically defined by the connection) and choose the appropriatemedia accordingly. So, for example, the following code fragments selects the videoif the bitrate is 112Kbytes or above, a slideshow if bitrate is above 56Kbytes andfinally if the connection has an even lower bitrate a text is shown.

<switch><video src="videofile.avi" systemBitrate="115200"/><seq systemBitrate="57344">

<img src="img1.png" dur="5s"/>...<img src="imeg12.png" dur="9s"/>

</seq><text src="desc.html" dur="30s"/>

</switch>

The layout in a SMIL document is specified by a hierarchial specification ofregions on the page. As, subregions are defined relative to the position of theirparents they can be moved as a whole. An example is given in figure 10.2.

Keyterms in this chapter

XML,SMIL,MPEG7,mutlimedia document layout

Page 93: Multimedia Lecture Notes

10.3. SMIL: SYNCHRONIZED MULTIMEDIA INTEGRATION LANGUAGE93

<toplayout id="id1" witdth="800" height="600"> <region id="id2" left="80" top="40" width="30" height="40">

<region id="id3" left="40" top="60" width="20" height="10"> </region> <region id="id4" left="420" top="80" width="30" height="40">

</toplayout>

ID2 ID4

ID1

ID3

(0,0)

Figure 10.2: Simple layout example in SMIL. The toplayout defines the size of thepresentation as a whole. The other regions are defined as subregions of their parent.

Page 94: Multimedia Lecture Notes

94 CHAPTER 10. DATA EXPORT

Page 95: Multimedia Lecture Notes

Bibliography

[1] S. Abney. Part-of-speech tagging and partial parsing. In S. Young andG. Bloothooft, editors, Corpus-Based Methods in Language and Speech Pro-cessing, pages 118–136. Kluwer Academic Publishers, Dordrecht, 1997.

[2] D.A. Adjeroh, I. King, and M.C. Lee. A distance measure for video sequences.Computer Vision and Image Understanding, 75(1):25–45, 1999.

[3] Ph. Aigrain, Ph. Joly, and V. Longueville. Medium Knowledge-Based Macro-Segmentation of Video Into Sequences, chapter 8, pages 159–173. IntelligentMultimedia Information Retrieval. AAAI Press, 1997.

[4] A.A. Alatan, A.N. Akansu, and W. Wolf. Multi-modal dialogue scene de-tection using hidden markov models for content-based multimedia indexing.Multimedia Tools and Applications, 14(2):137–151, 2001.

[5] J. Allen. Maintaining knowledge about temporal intervals. Communicationsof the ACM, 26:832–843, 1983.

[6] Y. Altunbasak, P.E. Eren, and A.M. Tekalp. Region-based parametric motionsegmentation using color information. Graphical models and image processing,60(1):13–23, 1998.

[7] J. Assfalg, A. del Bimbo, and P. Pala. Using multiple examples for contentbased retrieval. In Proc. Int. Conf. on Multimedia and Expo. IEEE Press,2000.

[8] N. Babaguchi, Y. Kawai, and T. Kitahashi. Event based indexing of broad-casted sports video by intermodal collaboration. IEEE Transactions on Mul-timedia, 4(1):68–75, 2002.

[9] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisher-faces: Recognition using class specific linear projection. IEEE Transactionson Pattern Analysis and Machine Intelligence, 19(7):711–720, 1997.

[10] M. Bertini, A. Del Bimbo, and P. Pala. Content-based indexing and retrievalof TV news. Pattern Recognition Letters, 22(5):503–516, 2001.

[11] D. Bikel, R. Schwartz, and R.M. Weischedel. An algorithm that learns what’sin a name. Machine Learning, 34(1-3):211–231, 1999.

[12] J.M. Boggs and D.W. Petrie. The Art of Watching Films. Mayfield PublishingCompany, Mountain View, USA, 5th edition, 2000.

[13] J.M. Boggs and D.W. Petrie. The art of watching films. Mayfield PublishingCompany, Mountain View, CA, 5th edition, 2000.

[14] R.M. Bolle, B.-L. Yeo, and M. Yeung. Video query: Research directions. IBMJournal of Research and Development, 42(2):233–252, 1998.

95

Page 96: Multimedia Lecture Notes

96 BIBLIOGRAPHY

[15] R.M. Bolle, B.-L. Yeo, and M.M. Yeung. Video query: Research directions.IBM Journal of Research and Development, 42(2):233–252, 1998.

[16] A. Bonzanini, R. Leonardi, and P. Migliorati. Event recognition in sportprograms using low-level motion indices. In IEEE International Conferenceon Multimedia & Expo, pages 1208–1211, Tokyo, Japan, 2001.

[17] M. Brown, J. Foote, G. Jones, K. Sparck-Jones, and S. Young. Automaticcontent-based retrieval of broadcast news. In ACM Multimedia 1995, SanFrancisco, USA, 1995.

[18] R. Brunelli, O. Mich, and C.M. Modena. A survey on the automatic indexingof video data. Journal of Visual Communication and Image Representation,10(2):78–112, 1999.

[19] D.C.A. Bulterman. Smil 2.0: Part1: overview, concepts, and structure. IEEEMultimedia, pages 82–88, July-September 2001.

[20] D.C.A. Bulterman. Smil 2.0: Part2: examples and comparisons. IEEE Mul-timedia, pages 74–84, January-March 2002.

[21] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based imagequerying. In Proc. Int. workshop on Content-based access of Image and Videolibraries. IEEE Press, 1997.

[22] M. La Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cuesfor content-based image retrieval on the world wide web. In IEEE Workshopon Content-Based Access of Image and Video Libraries, 1998.

[23] S-F. Chang, J. R. Smith, M. Beigi, and A. Benitez. Visual information re-trieval from large distributed online repositories. Comm. ACM, 40(12):63 –71, 1997.

[24] P. Chiu, Girgensohn, W. Polak, E. Rieffel, and L. Wilcox. A genetic algorithmfor video segmentation and summarization. In IEEE International Conferenceon Multimedia and Expo, volume 3, pages 1329–1332, 2000.

[25] M. Christel, A. Olligschlaeger, and C. Huang. Interactive maps for a digitalvideo library. IEEE Multimedia, 7(1):60–67, 2000.

[26] G. Ciocca and R. Schettini. Using a relevance feedback mechanism to improvecontent-based image retrieval. In Proc. Visual99: Information and Informa-tion Systems, number 1614 in Lect. Notes in Comp. Sci., pages 107–114.Springer Verlag GmbH, 1999.

[27] C. Colombo, A. Del Bimbo, and P. Pala. Semantics in visual informationretrieval. IEEE Multimedia, 6(3):38–53, 1999.

[28] Convera. http://www.convera.com.

[29] I. J. Cox, M. L. Miller, T. P. Minka, and T. V. Papathomas. The bayesian im-age retrieval system, PicHunter: theory, implementation, and pychophysicalexperiments. IEEE trans. IP, 9(1):20 – 37, 2000.

[30] G. Davenport, T. Aguierre Smith, and N. Pincever. Cinematic principles formultimedia. IEEE Computer Graphics & Applications, pages 67–74, 1991.

[31] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman.Indexing by latent semantic analysis. Journal of the American Society forInformation Science, 41(6):391–407, 1990.

Page 97: Multimedia Lecture Notes

BIBLIOGRAPHY 97

[32] A. del Bimbo and P. Pala. Visual image retrieval by elastic matching of usersketches. IEEE trans. PAMI, 19(2):121–132, 1997.

[33] N. Dimitrova, L. Agnihotri, and G. Wei. Video classification based on HMMusing text and faces. In European Signal Processing Conference, Tampere,Finland, 2000.

[34] S. Eickeler and S. Muller. Content-based video indexing of TV broadcast newsusing hidden markov models. In IEEE International Conference on Acoustics,Speech, and Signal Processing, pages 2997–3000, Phoenix, USA, 1999.

[35] K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal. Speech/music discrim-ination for multimedia applications. In IEEE International Conference onAcoustics, Speech, and Signal Processing, pages 2445–2448, Istanbul, Turkey,2000.

[36] F. Ennesser and G. Medioni. Finding Waldo, or focus of attention using localcolor information. IEEE trans. PAMI, 17(8):805 – 809, 1995.

[37] S. Fischer, R. Lienhart, and W. Effelsberg. Automatic recognition of filmgenres. In ACM Multimedia 1995, pages 295–304, San Francisco, USA, 1995.

[38] M.M. Fleck, D.A. Forsyth, and C. Bregler. Finding naked people. In EuropeanConference on Computer Vision, volume 2, pages 593–602, Cambridge, UK,1996.

[39] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom,M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Queryby image and video content: the QBIC system. IEEE Computer, 1995.

[40] B. Furht, S.W. Smoliar, and H-J. Zhang. Video and Image Processing inMultimedia Systems. Kluwer Academic Publishers, Norwell, USA, 2th edition,1996.

[41] C. Gane and T. Sarson. Structured systems analysis: Tools and Techniques.New York: IST. Inc., 1977.

[42] T. Gevers and A. W. M. Smeulders. Pictoseek: combining color and shapeinvariant features for image retrieval. IEEE trans. IP, 9(1):102 – 119, 2000.

[43] A. Ghias, J. Logan, D. Chamberlin, and B.C. Smith. Query by humming –musical information retrieval in an audio database. In ACM Multimedia 1995,San Francisco, USA, 1995.

[44] Y. Gong, L.T. Sin, and C.H. Chuan. Automatic parsing of TV soccer pro-grams. In IEEE International Conference on Multimedia Computing and Sys-tems, pages 167–174, 1995.

[45] B. Gunsel, A.M. Ferman, and A.M. Tekalp. Video indexing through integra-tion of syntactic and semantic features. In Third IEEE Workshop on Appli-cations of Computer Vision, Sarasota, USA, 1996.

[46] A. Gupta and R. Jain. Visual information retrieval. Comm. ACM, 40(5):71–79, 1997.

[47] N. Haering, R. Qian, and I. Sezan. A semantic event-detection approach andits application to detecting hunts in wildlife video. IEEE Transactions onCircuits and Systems for Video Technology, 10(6):857–868, 2000.

Page 98: Multimedia Lecture Notes

98 BIBLIOGRAPHY

[48] A. Hampapur, R. Jain, and T. Weymouth. Feature based digital video in-dexing. In IFIP 2.6 Third Working Conference on Visual Database Systems,Lausanne, Switzerland, 1995.

[49] A. Hanjalic, G. Kakes, R.L. Lagendijk, and J. Biemond. Dancers: Delftadvanced news retrieval system. In IS&T/SPIE Electronic Imaging 2001:Storage and Retrieval for Media Databases 2001, San Jose, USA, 2001.

[50] A. Hanjalic, R.L. Lagendijk, and J. Biemond. Automated high-level moviesegmentation for advanced video retrieval systems. IEEE Transactions onCircuits and Systems for Video Technology, 9(4):580–588, 1999.

[51] A. Hanjalic, G.C. Langelaar, P.M.B. van Roosmalen, J. Biemond, and R.L.Lagendijk. Image and Video Databases: Restoration, Watermarking and Re-trieval. Elsevier Science, Amsterdam, The Netherlands, 2000.

[52] A.G. Hauptmann, D. Lee, and P.E. Kennedy. Topic labeling of multilingualbroadcast news in the informedia digital video library. In ACM DL/SIGIRMIDAS Workshop, Berkely, USA, 1999.

[53] A.G. Hauptmann and M.J. Witbrock. Story segmentation and detection ofcommercials in broadcast news video. In ADL-98 Advances in Digital Li-braries, pages 168–179, Santa Barbara, USA, 1998.

[54] A. Hiroike, Y. Musha, A. Sugimoto, and Y. Mori. Visualization of informationspaces to retrieve and browse image data. In Proc. Visual99: Information andInformation Systems, volume 1614 of Lect. Notes in Comp. Sci., pages 155–162. Springer Verlag GmbH, 1999.

[55] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E.K. Wong. Integration of multi-modal features for video scene classification based on HMM. In IEEE Work-shop on Multimedia Signal Processing, Copenhagen, Denmark, 1999.

[56] I. Ide, K. Yamamoto, and H. Tanaka. Automatic video indexing based onshot classification. In First International Conference on Advanced Multime-dia Content Processing, volume 1554 of Lecture Notes in Computer Science,Osaka, Japan, 1999. Springer-Verlag.

[57] A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: a review.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4 –37, 2000.

[58] R. Jain, editor. NSF Workshop on Visual Information Management Systems,Redwood, CA, 1992.

[59] R. Jain and A. Hampapur. Metadata in video databases. ACM SIGMOD,23(4):27–33, 1994.

[60] P.J. Jang and A.G. Hauptmann. Learning to recognize speech by watchingtelevision. IEEE Intelligent Systems, 14(5):51–58, September/October 1999.

[61] R.S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, andD. Li. Integrated multimedia processing for topic segmentation and classifica-tion. In IEEE International Conference on Image Processing, pages 366–369,Thessaloniki, Greece, 2001.

[62] O. Javed, Z. Rasheed, and M. Shah. A framework for segmentation of talk& game shows. In IEEE International Conference on Computer Vision, Van-couver, Canada, 2001.

Page 99: Multimedia Lecture Notes

BIBLIOGRAPHY 99

[63] T. Kakimoto and Y. Kambayashi. Browsing functions in three-dimensionalspace for digital libraries. Int. Journ. of Digital Libraries, 2:68–78, 1999.

[64] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch retrieval method forfull color image database - query by visual example. In Proc. of the ICPR,Computer Vision and Applications, pages 530–533, 1992.

[65] J.R. Kender and B.L. Yeo. Video scene segmentation via continuous videocoherence. In CVPR’98, Santa Barbara, CA. IEEE, IEEE, June 1998.

[66] V. Kobla, D. DeMenthon, and D. Doermann. Identification of sports videosusing replay, text, and camera motion features. In SPIE Conference on Stor-age and Retrieval for Media Databases, volume 3972, pages 332–343, 2000.

[67] R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++:A machine learning library in C++. In Proceedings of the 8th Interna-tional Conference on Tools with Artificial Intelligence., pages 234–245, 1996.http://www.sgi.com/Technology/mlc.

[68] Y.-M. Kwon, C.-J. Song, and I.-J. Kim. A new approach for high level videostructuring. In IEEE International Conference on Multimedia and Expo, vol-ume 2, pages 773–776, 2000.

[69] B. Wielinga M. Worring L. Hollink, A. Schreiber. Classification of user imagedescriptions. Submitted for publication, 2002.

[70] L. J. Latecki and R. Lakamper. Contour-based shape similarity. In Proc.Visual99: Information and Information Systems, number 1614 in Lect. Notesin Comp. Sci., pages 617–624, 1999.

[71] S.-Y. Lee, S.-T. Lee, and D.-Y. Chen. Automatic Video Summary and De-scription, volume 1929 of Lecture Notes in Computer Science, pages 37–48.Springer-Verlag, Berlin, 2000.

[72] M. S. Lew, K. Lempinen, and H. Briand. Webcrawling using sketches. In Proc.Visual97: Information Systems, pages 77–84. Knowledge Systems Institute,1997.

[73] D. Li, I.K. Sethi, N. Dimitrova, and T. McGee. Classification of general audiodata for content-based retrieval. Pattern Recognition Letters, 22(5):533–544,2001.

[74] H. Li, D. Doermann, and O. Kia. Automatic text detection and tracking indigital video. IEEE Transactions on Image Processing, 9(1):147–156, 2000.

[75] R. Lienhart, C. Kuhmunch, and W. Effelsberg. On the detection and recogni-tion of television commercials. In IEEE Conference on Multimedia Computingand Systems, pages 509–516, Ottawa, Canada, 1997.

[76] R. Lienhart, S. Pfeiffer, and W. Effelsberg. Scene determination based onvideo and audio features. In Proc. of the 6th IEEE Int. Conf. on MultimediaSystems, volume 1, pages 685–690, 1999.

[77] T. Lin and H.-J. Zhang. Automatic video scene extraction by shot grouping.In Proceedings of ICPR ’00, Barcelona, Spain, 2000.

[78] G. Lu. Indexing and retrieval of audio: a survey. Multimedia Tools andApplications, 15:269–290, 2001.

Page 100: Multimedia Lecture Notes

100 BIBLIOGRAPHY

[79] C.D. Manning and H. Schutze. Foundations of Statistical Natural LanguageProcessing. The MIT Press, Cambridge, USA, 1999.

[80] J.M. Martinez, R. Koenen, and F. Pereira. MPEG-7 the generic multimediacontent description standard, part 1. IEEE Multimedia, april-june 2002.

[81] C. Meilhac and C. Nastar. Relevance feedback and category search in imagedatabases. In Proc. Int. Conf. on Multimedia Computing and Systems, pages512–517. IEEE Press, 1999.

[82] K. Minami, A. Akutsu, H. Hamada, and Y. Tomomura. Video handling withmusic and speech detection. IEEE Multimedia, 5(3):17–25, 1998.

[83] T. P. Minka and R. W. Picard. Interactive learning with a “society of models.”.Pattern Recognition, 30(4):565–582, 1997.

[84] H. Miyamori and S. Iisaku. Video annotation for content-based retrieval usinghuman behavior analysis and domain knowledge. In IEEE International Con-ference on Automatic Face and Gesture Recognition, pages 26–30, Grenoble,France, 2000.

[85] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detec-tion in images by components. IEEE Transactions on Pattern Analysis andMachine Intelligence, 23(4):349–361, 2001.

[86] S. Moncrieff, C. Dorai, and S. Venkatesh. Detecting indexical signs in film au-dio for scene interpretation. In IEEE International Conference on Multimedia& Expo, pages 1192–1195, Tokyo, Japan, 2001.

[87] H. Muller and E. Tan. Movie maps. In International Conference on Informa-tion Visualization, London, England, 1999. IEEE.

[88] Frank Nack and Adam T. Lindsay. Everything you wanted to know aboutmpeg-7: Part 1. IEEE Multimedia, 6(3):65–77, 1999.

[89] Frank Nack and Adam T. Lindsay. Everything you wanted to know aboutmpeg-7: Part 2. IEEE Multimedia, 6(4):64–73, 1999.

[90] J. Nam, M. Alghoniemy, and A.H. Tewfik. Audio-visual content-based violentscene characterization. In IEEE International Conference on Image Process-ing, volume 1, pages 353–357, Chicago, USA, 1998.

[91] J. Nam, A. Enis Cetin, and A.H. Tewfik. Speaker identification and videoanalysis for hierarchical video shot classification. In IEEE International Con-ference on Image Processing, volume 2, Washington DC, USA, 1997.

[92] M.R. Naphade and T.S. Huang. A probabilistic framework for semantic videoindexing, filtering, and retrieval. IEEE Transactions on Multimedia, 3(1):141–151, 2001.

[93] H.T. Nguyen, M. Worring, and A. Dev. Detection of moving objects in videousing a robust motion similarity measure. IEEE Transactions on Image Pro-cessing, 9(1):137–141, 2000.

[94] L. Nigay and J. Coutaz. A design space for multimodal systems: concurrentprocessing and data fusion. In INTERCHI’93 Proceedings, pages 172–178,Amsterdam, the Netherlands, 1993.

[95] D.W. Oard. The state of the art in text filtering. User Modeling and User-Adapted Interaction, 7(3):141–178, 1997.

Page 101: Multimedia Lecture Notes

BIBLIOGRAPHY 101

[96] V. E. Ogle. CHABOT - retrieval from a relational database of images. IEEEComputer, 28(9):40 – 48, 1995.

[97] H. Pan, P. Van Beek, and M.I. Sezan. Detection of slow-motion replay seg-ments in sports video for highlights generation. In IEEE International Con-ference on Acoustic, Speech and Signal Processing, 2001.

[98] M. L. Pao and M. Lee. Concepts of Information Retrieval. Libraries Unlim-ited, 1989.

[99] N.V. Patel and I.K. Sethi. Audio characterization for video indexing. In Pro-ceedings SPIE on Storage and Retrieval for Still Image and Video Databases,volume 2670, pages 373–384, San Jose, USA, 1996.

[100] N.V. Patel and I.K. Sethi. Video classification using speaker identification.In IS&T SPIE, Proceedings: Storage and Retrieval for Image and VideoDatabases IV, San Jose, USA, 1997.

[101] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, San Mateo, USA, 1988.

[102] A.K. Peker, A.A. Alatan, and A.N. Akansu. Low-level motion activity featuresfor semantic characterization of video. In IEEE International Conference onMultimedia & Expo, New York City, USA, 2000.

[103] A. Pentland, B. Moghaddam, and T. Starner. View-based and modulareigenspaces for face recognition. In IEEE International Conference on Com-puter Vision and Pattern Recognition, Seattle, USA, 1994.

[104] S. Pfeiffer, S. Fischer, and W. Effelsberg. Automatic audio content analysis.In ACM Multimedia 1996, pages 21–30, Boston, USA, 1996.

[105] S. Pfeiffer, R. Lienhart, and W. Effelsberg. Scene determination based onvideo and audio features. Multimedia Tools and Applications, 15(1):59–81,2001.

[106] T.V. Pham and M. Worring. Face detection methods: A critical evaluation.Technical Report 2000-11, Intelligent Sensory Information Systems, Univer-sity of Amsterdam, 2000.

[107] R. W. Picard and T. P. Minka. Vision texture for annotation. Multimediasystems, 3:3–14, 1995.

[108] Praja. http://www.praja.com.

[109] L.R. Rabiner. A tutorial on hidden markov models and selected applicationsin speech recognition. Proceedings of IEEE, 77(2):257–286, 1989.

[110] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, 1998.

[111] Y. Rui and T. S. Huang. Optimizing learning in image retrieval. In Proc.Computer Vision and Pattern Recognition, pages 236 – 243. IEEE Press, 2000.

[112] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: apower tool for interactive content-based image retrieval. IEEE trans. CVT,1998.

Page 102: Multimedia Lecture Notes

102 BIBLIOGRAPHY

[113] Y. Rui, T.S. Huang, and S. Mehrotra. Constructing table-of-content forvideos. Multimedia Systems, Special section on Video Libraries, 7(5):359–368,1999.

[114] E. Sahouria and A. Zakhor. Content analysis of video using principal com-ponents. IEEE Transactions on Circuits and Systems for Video Technology,9(8):1290–1298, 1999.

[115] E. Sahouria and A. Zakhor. Content analysis of video using principal com-ponents. IEEE Transactions on Circuits and Systems for Video Technology,9(8):1290–1298, 1999.

[116] G. Salton and M. McGill. Introduction to Modern Information Retrieval.McGraw-Hill, 1983.

[117] H. Samet and A. Soffer. MARCO: MAp Retrieval by COntent. IEEE trans.PAMI, 18(8):783–798, 1996.

[118] S. Santini, A. Gupta, and R. Jain. User interfaces for emergent semantics inimage databases. In Proc. 8th IFIP Working Conf. on Database Semantics(DS-8), 1999.

[119] C. Saraceno and R. Leonardi. Identification of story units in audio-visualsequences by joint audio and video processing. In IEEE International Con-ference on Image Processing, Chicago, USA, 1998.

[120] S. Satoh, Y. Nakamura, and T. Kanade. Name-it: Naming and detectingfaces in news videos. IEEE Multimedia, 6(1):22–35, 1999.

[121] D.D. Saur, Y.-P. Tan, S.R. Kulkarni, and P.J. Ramadge. Automated analysisand annotation of basketball video. In SPIE’s Electronic Imaging conferenceon Storage and Retrieval for Image and Video Databases V, volume 3022,pages 176–187, San Jose, USA, 1997.

[122] H. Schneiderman and T. Kanade. A statistical method for 3D object detectionapplied to faces and cars. In IEEE Computer Vision and Pattern Recognition,Hilton Head, USA, 2000.

[123] S. Sclaroff. Deformable prototypes for encoding shape categories in imagedatabases. Pattern Recognition, 30(4):627 – 641, 1997.

[124] K. Shearer, C. Dorai, and S. Venkatesh. Incorporating domain knowledgewith video and voice data analysis in news broadcasts. In ACM InternationalConference on Knowledge Discovery and Data Mining, pages 46–53, Boston,USA, 2000.

[125] J. Shim, C. Dorai, and R. Bolle. Automatic text extraction from video forcontent-based annotation and retrieval. In IEEE International Conference onPattern Recognition, pages 618–620, 1998.

[126] C-R. Shyu, C. E. Brodley, A. C. Kak, and A. Kosaka. ASSERT: a physicianin the loop content-based retrieval system for HCRT image databases. ImageUnderstanding, 75(1/2):111–132, 1999.

[127] A. W. M. Smeulders, S. D. Olabariagga, R. van den Boomgaard, and M. Wor-ring. Interactive segmentation. In Proc. Visual97: Information Systems, pages5–12. Knowledge Systems Institute, 1997.

Page 103: Multimedia Lecture Notes

BIBLIOGRAPHY 103

[128] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Contentbased image retrieval at the end of the early years. IEEE Transactions onPattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.

[129] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Contentbased image retrieval at the end of the early years. IEEE Transactions onPattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.

[130] J. R. Smith and S-F. Chang. Integrated spatial and feature image query.Multimedia systems, 7(2):129 – 140, 1999.

[131] C.G.M. Snoek and M. Worring. Multimodal video indexing: a review of thestate-of-the-art. Multimedia Tools and Applications, 2002. to appear.

[132] R.K. Srihari. Automatic indexing and content-based retrieval of captionedimages. IEEE Computer, 28(9):49–56, 1995.

[133] M. Stricker and M. Orengo. Similarity of color images. In Storage and Re-trieval of Image and Video Databases III, pages 381–392. SPIE Press vol.2420, 1995.

[134] G. Sudhir, J.C.M. Lee, and A.K. Jain. Automatic classification of tennisvideo for high-level content-based retrieval. In IEEE International Workshopon Content-Based Access of Image and Video Databases, in conjunction withICCV’98, Bombay, India, 1998.

[135] H. Sundaram and S.-F. Chang. Determining computable scenes in films andtheir structures using audio visual memory models. In Proceedings of the 8thACM Multimedia Conference, Los Angeles, CA, 2000.

[136] M. J. Swain and B. H. Ballard. Color indexing. Int. Journ. of ComputerVision, 7(1):11 – 32, 1991.

[137] M. Szummer and R.W. Picard. Indoor-outdoor image classification. InIEEE International Workshop on Content-based Access of Image and VideoDatabases, in conjunction with ICCV’98, Bombay, India, 1998.

[138] J. Tatemura. Browsing images based on social and content similarity. In Proc.Int. Conf. on Multimedia and Expo. IEEE Press, 2000.

[139] L. Teodosio and W. Bender. Salient video stills: content and context pre-served. In Proceedings of the First ACM International Conference on Multi-media, pages 39–46, 1993.

[140] B.T. Truong and S. Venkatesh. Determining dramatic intensification via flash-ing lights in movies. In IEEE International Conference on Multimedia & Expo,pages 61–64, Tokyo, Japan, 2001.

[141] B.T. Truong, S. Venkatesh, and C. Dorai. Automatic genre identificationfor content-based video categorization. In IEEE International Conference onPattern Recognition, Barcelona, Spain, 2000.

[142] S. Tsekeridou and I. Pitas. Content-based video parsing and indexing basedon audio-visual interaction. IEEE Transactions on Circuits and Systems forVideo Technology, 11(4):522–535, 2001.

[143] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. Content-based hierarchicalclassification of vacation images. In Proc. Int. Conf. on Multimedia Computingand Systems, 1999.

Page 104: Multimedia Lecture Notes

104 BIBLIOGRAPHY

[144] A. Vailaya and A.K. Jain. Detecting sky and vegetation in outdoor images.In Proceedings of SPIE: Storage and Retrieval for Image and Video DatabasesVIII, volume 3972, San Jose, USA, 2000.

[145] A. Vailaya, A.K. Jain, and H.-J. Zhang. On image classification: City imagesvs. landscapes. Pattern Recognition, 31:1921–1936, 1998.

[146] J. Vendrig. Interactive exploration of multimedia content. PhD thesis, 2002.

[147] J. Vendrig and M. Worring. Feature driven visualization of video contentfor interactive indexing. In R. Laurini, editor, Visual Information and Infor-mation Systems, volume 1929 of Lecture Notes in Computer Science, pages338–348, Berlin, 2000. Springer-Verlag.

[148] J. Vendrig and M. Worring. Evaluation of logical story unit segmentation invideo sequences. In IEEE International Conference on Multimedia & Expo,pages 1092–1095, Tokyo, Japan, 2001.

[149] J. Vendrig, M. Worring, and A. W. M. Smeulders. Filter image browsing:exploiting interaction in retrieval. In Proc. Visual99: Information and Infor-mation Systems, volume 1614 of Lect. Notes in Comp. Sci. Springer VerlagGmbH, 1999.

[150] E. Veneau, R. Ronfard, and P. Bouthemy. From video shot clustering tosequence segmentation. In Proceedings of ICPR ’00, volume 4, pages 254–257, Barcelona, Spain, 2000.

[151] Virage. http://www.virage.com.

[152] E.M. Voorhees, editor. Proceedings of the 10th Text Retrieval Conference(TREC). NIST, 2001.

[153] Y. Wang, Z. Liu, and J. Huang. Multimedia content analysis using both audioand visual clues. IEEE Signal Processing Magazine, 17(6):12–36, 2000.

[154] K. Weixin, R. Yao, and L. Hanqing. A new scene breakpoint detection al-gorithm using slice of video stream. In H.H.S. Ip and A.W.M. Smeulders,editors, MINAR’98, pages 175–180, Hongkong, China, 1998. IAPR.

[155] T. Westerveld. Image retrieval: Content versus context. In Content-Based Multimedia Information Access, RIAO 2000 Conference, pages 276–284, Paris, France, 2000.

[156] I.H. Witten and E. Frank. Data mining: practical machine learning tools andtechniques with java implementations. Morgan Kaufmann Publishers, 2000.

[157] E. Wold, T. Blum, D. Keislar, and J. Wheaton. Content-based classification,search, and retrieval of audio. IEEE Multimedia, 3(3):27–36, 1996.

[158] M. Worring, A.W.M. Smeulders, and S. Santini. Interaction in content-basedretrieval: an evaluation of the state-of-the-art. In R. Laurini, editor, Advancesin Visual Information Systems, number 1929 in Lect. Notes in Comp. Sci.,pages 26–36, 2000.

[159] L. Wu, J. Benois-Pineau, and D. Barba. Spatio-temporal segmentation ofimage sequences for object-oriented low bit-rate image coding. Image Com-munication, 8(6):513–544, 1996.

Page 105: Multimedia Lecture Notes

BIBLIOGRAPHY 105

[160] P. Xu, L. Xie, S.-F. Chang, A. Divakaran, A. Vetro, and H. Sun. Algo-rithms and systems for segmentation and structure analysis in soccer video.In IEEE International Conference on Multimedia & Expo, pages 928–931,Tokyo, Japan, 2001.

[161] M.-H. Yang, D.J. Kriegman, and N. Ahuja. Detecting faces in images: Asurvey. IEEE Transactions on Pattern Analysis and Machine Intelligence,24(1):34–58, 2002.

[162] M. Yeung, B.-L. Yeo, and B. Liu. Segmentation of video by clustering andgraph analysis. Computer Vision and Image Understanding, 71(1):94–109,1998.

[163] M.M. Yeung and B.-L. Yeo. Video content characterization and compactionfor digital library applications. In IS&T/SPIE Storage and Retrieval of Imageand Video Databases V, volume 3022, pages 45–58, 1997.

[164] H.-J. Zhang, A. Kankanhalli, and S.W. Smoliar. Automatic partitioning offull-motion video. Multimedia Systems, 1(1):10–28, 1993.

[165] H.-J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong. Automatic parsing andindexing of news video. Multimedia Systems, 2(6):256–266, 1995.

[166] T. Zhang and C.-C.J. Kuo. Hierarchical classification of audio data for archiv-ing and retrieving. In IEEE International Conference on Acoustics, Speech,and Signal Processing, volume 6, pages 3001–3004, Phoenix, USA, 1999.

[167] D. Zhong and S.-F. Chang. Structure analysis of sports video using domainmodels. In IEEE International Conference on Multimedia & Expo, pages920–923, Tokyo, Japan, 2001.

[168] Y. Zhong, H.-J. Zhang, and A.K. Jain. Automatic caption localization incompressed video. IEEE Transactions on Pattern Analysis and Machine In-telligence, 22(4):385–392, 2000.

[169] W. Zhou, A. Vellaikal, and C.-C.J. Kuo. Rule-based video classification systemfor basketball video indexing. In ACM Multimedia 2000, Los Angeles, USA,2000.

[170] W. Zhu, C. Toklu, and S.-P. Liou. Automatic news video segmentation andcategorization based on closed-captioned text. In IEEE International Con-ference on Multimedia & Expo, pages 1036–1039, Tokyo, Japan, 2001.