the rapid development of innovative tools to create user ... · web viewmetrics to measure...

SCHEMA IST-2001-32795 05/06/2003

INFORMATION SOCIETY TECHNOLOGIES(IST)

PROGRAMME

Project Number: IST-2001-32795Project Title: Network of Excellence in Content-Based Semantic Scene Analysis and Information RetrievalDeliverable Type: PU

Deliverable Number: D2.1Contractual Date of Delivery: 30.09.2002 (month 5 of the project)Actual Date of Delivery: 20.09.2002Title of Deliverable: State of the art in content-based analysis, indexing and retrievalWork-Package contributing to the Deliverable: Nature of the Deliverable: REWork-Package Leader: Ebroul Izquierdo, Queen Mary University of LondonAuthors (Alphabetical order): Michel Barlaud (University of Nice-Sophia Antipolis), Ebroul Izquierdo (Queen Mary University of London), Riccardo Leonardi (University of Brescia, Italy), Vasileios Mezaris (Informatics and Telematics Institute), Pierangelo Migliorati (University of Brescia, Italy), Evangelia Triantafyllou (Informatics and Telematics Institute), Li-Qun Xu (BTExact Technologies).

Abstract: The amount of audiovisual information available in digital format has grown exponentially in recent years. Gigabytes of new images, audio and video clips are generated and stored everyday. This has led to a huge distributed and mostly unstructured repository of multimedia information. In order to realize the full potential of these databases, tools for automated indexing and intelligent search engines are urgently needed. Indeed, image and video cataloguing, indexing and retrieval are the subject of active research in industry and academia across the world. This reflects the commercial importance of such technology. The aim of this report is to give the reader a review of current state of content based retrieval systems. To begin with, the need for such systems and their potential applications are introduced. The second section deals with techniques for temporal segmentation of raw video, while in section three similar methods for compressed video are described. Video segmentation using shape modelling is then described in section 4. Low level descriptors for image indexing and retrieval are reported in section five, whereas techniques for the automatic generation of semantic descriptors (high-level) are described in section six. Metrics to measure similarity between descriptors in the metadata space are reported in section seven. Audio content analysis for multimedia indexing and retrieval literature is given in section eight, and content characterization of sports programs is described in section nine. The report closes describing the most relevant commercial and non commercial multimedia portals in sections ten and eleven.

Keyword List: Video Indexing, Retrieval, Multimedia Portals, Multimedia Systems*Type: PU-public

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Table of Content

1. M

otivations, Applications and Needs..........................................................................................3

2. Temporal Segmentation Using Uncompressed Video.......................................................5

3. Temporal Segmentation and Indexing in the Compressed Domain................................6

4. Video Segmentation Using Shape Modeling....................................................................................9

5. Image Indexing in the Spatial Domain..............................................................................13

6. High-Level Descriptors.......................................................................................................26

7. Defining Metrics between Descriptors and Relevance Feedback...................................28

8. Audio-based and Audio-assisted Semantic Content Analysis ........................................29

9. Content Characterization of Sports Programs ...............................................................30

10. Content-Based Indexing and Retrieval Systems............................................................41

11. Other commercial content-based image retrieval systems............................................44

11. References..........................................................................................................................46

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

1. Motivations, Applications and NeedsThe rapid development of innovative tools to create user friendly and effective multimedia libraries, services and environments requires novel concepts to support storage of huge amounts of digital visual data and fast retrieval. Currently, whole digital libraries of films, video sequences and images are being created, guaranteeing an everlasting quality to the documents stored. As a result of - almost daily - improvements in encoding and transmission schemes, the items of these databases are easily accessible by anyone on the planet. In order to realize the full potential of these technologies, tools for automated indexing and intelligent search engines are urgently needed. Indeed, image and video cataloguing, indexing and retrieval are the subject of active research in industry and academia across the world. This reflects the commercial importance of such technology and evidence the fact that there are many problems left unsolved by currently implemented systems.

In conventional systems visual items are manually annotated with textual descriptions of the content of the item. For instance if an image can be manually labeled as ‘city centre’, then the problem may appear to have been finessed. However, the adequacy of such a solution depends on human interaction, which is expensive and time consuming and therefore infeasible for many applications. Furthermore, such semantic based search is completely subjective and depends on semantic accuracy in describing the image. While a human operator could label an image as “city centre”, a second would prefer the term “traffic jam”, a third will think of ‘streets and building’, etc. Indeed, the richness of the content of an image is difficult to describe with a few keywords, and the perception of an image is a subjective and task-dependent process. Trying to foresee which elements of the images will be the most useful for later retrievals is often very difficult. The problem is exacerbated in the case of video, where motion and temporality come into play.

Much work on image indexing and retrieval has focused on the definition of suitable descriptors and the generation of metrics in the descriptor space. Although system efficiency in terms of speed and computational complexity has been also the subject of research, many related problems are still unsolved. The major problem to be faced when efficient schemas for image indexing and retrieval are envisaged is the large workload and high complexity of the underlying image-processing algorithms. Basically, fast indexing cataloguing and retrieval are fundamental requirements of any user friendly and effective retrieval scheme. The search for a fast and efficient and accurate method based on inherent image primitives is a very important and open problem in advanced multimedia systems.

According to the used features, techniques for video indexing and retrieval can be grouped into two types: low-level and semantic. Low-level visual features refers to the use of primitives such as colour, shapes, textures etc. Semantic content contains high-level concepts such as objects and events. The semantic content can be presented through many different visual presentations. The main distinction between these two types of content is different requirements for their extraction

Important applications of content based image and video retrieval technology include the medical domain where powerful visualization methods have been developed in the last few years: from X-rays to MRI. As a result, a vast quantity of medical images are generated each year. These images need to be analysed and archived for later use. Satellites screen our planet, and send us hundreds of images everyday for a wide range of purposes from military to ecological. For the analysts on the ground, it is important to have tools to organise and browse

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

these images at multiple resolutions. Some work has been done to satisfy these needs [A49, A92, A85]. Art galleries and museums store their collections digitally for inventory purposes, as well as making them available on CD-Roms or on the Internet. The need for suitable indexing and retrieval techniques has already been addressed in [A27, A9]. In the broadcasting industry journalists need systems for retrieving and quickly browsing archived sequences referring to a particular public figure or a particular event. Interactive television as a final application: as stated in [A11], viewers will need services which will allow them to search and download all types of television shows from distant sources.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

2. Temporal Segmentation Using Uncompressed VideoCognitively, the predominant feature in video is its higher-level temporal structure. People are unable to perceive millions of individual frames, but they can perceive episodes, scenes, and moving objects. A scene in a video is a sequence of frames that are considered to be semantically consistent. Scene changes therefore demarcate changes in semantic context. Segmenting a video into its constituent scenes permits it to be accessed in terms of meaningful units.

A video is physically formed by shots and semantically described by scenes. A shot is a sequence of frames representing continuous action in time and space. A scene is a story unit and consists of a sequence of connected or unconnected shots. Most of the current research efforts are devoted to shot-based video segmentation. Algorithms for scene change detection can be classified according to the features used for processing into uncompressed and compressed domain algorithms. Temporal segmentation is the process of decomposing video streams into these syntactic elements. Shots are a sequence of frames recorded continously by one camera and scenes are composed of a small number of interrelated shots that are unified by a given event [A8].

Differences between frames can be quantified by pairwise pixel comparisons, or with schemes based on intensity or colour histograms. Motion and dynamic scene analysis can also provide cues for temporal segmentation. A good review of these scene detection schemes is found in [A2]. Another approach is proposed by Corridoni and Del Bimbo to detect gradual transitions [A19]. They introduce a metric based on the chromatic properties. Ardizzone et al. [A3] proposed a neural network approach for scene detection in the video retrieval system JACOB [A13]. The approach reported in [A91, A106] uses a priori knowledge to identify scenes in a video sequence. In [A19] Corridoni and Del Bimbo focus on scene detection under a restricted condition: the shot/reverse shot scenes defined in [A58]. They exploit the periodicity in the composing shots induced by this shooting technique. In [A105], shots are grouped into clusters after a proximity matrix has been built. By adding temporal constraints during the clustering process, Yeung et al. make clusters of similar shots correspond to actual scenes or story units [A103]. In [A104], they extend their work to the automatic characterisation of whole video sequences.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

3. Temporal Segmentation and Indexing in the Compressed DomainAvoiding decompression of compressed visual items before their insertion in the database and their indexing has advantages in terms of storage, and computational time. This is particularly important in the case of video sequences: a typical movie, when compressed, occupies almost 100 times less memory than when decompressed [A61]. Current research attempts to perform video parsing and low-level feature extraction on images and video sequences compressed with the JPEG and MPEG standards [A20, A92, A16, A21, A52, A80, A4, A76, A105]. Other compression schemes have also been considered, in particular subband or wavelet-based schemes [A41, A43] and Vector Quantization schemes [A30].

In [A61] a ‘content access work’ to evaluate the performance of the next generation of coding schemes is proposed. Zhang et al. [A107] use motion to identify key-objects, and a framework for both video indexing and compression is proposed. Deng et al. [A21] have proposed an object-based video representation tailored to the MPEG-4 standard. Irani et al. [A33] have proposed a mosaic-based video compression scheme which could be combined with their mosaic-based video indexing and retrieval scheme. In [A4], scene detection is performed on JPEG coded sequences. Zhang et al. [A31] use a normalised Li norm to compare corresponding blocks of coefficients of successive DCT-coded images. This method requires less processing than the one reported in [A4] but according to [A31], it is more sensitive to gradual changes. In [A20], abrupt cuts are detected at motion discontinuities between two consecutive frames, from the macroblock information contained in the P and B frames. Chang et al. [A14] report an approach based on motion vectors for the VideoQ system.

In a more comprehensive study, Calic and Izquierdo [B25, B26, B27, B28] present an approach to the problem of key-frame extraction and video parsing in the compressed domain. The algorithms for the temporal segmentation and the extraction of key-frames are unified in one robust algorithm with real-time capabilities. A general difference metric is generated from features extracted from MPEG steams and a specific discrete curve evolution algorithm is applied for the metrics curve simplification. They use the notion of a dominant reference frame to the reference frame (I or P) used as prediction reference for most of the macroblocks from a subsequent B frame. The proposed algorithm shows high accuracy and robust performance running in real-time with good customisation possibilities.

3.1 Cut Detection

As stated previously, there are several camera cut detection algorithms that work in the spatial domain [B1]. They can be classified as pixel-based, statistic-based and histogram-based. Petel & Sethi [B2] exploit the possibility of using these algorithms directly in the compressed domain. Yeo & Liu [B3] have described a way to estimate DC sequence from P-frames and B-frames. Deardorif et al. [B4] studies the file size dynamics of Motion-JPEG to detect cuts. Deng & Manjunath [B5] investigate the motion dynamics of P-frames and B-frames. Zabih et al. [B6] present an approach based on edge features. Shen & et al. [B7] propose a method that applies Hausdorff distance histogram and multi-pass merging algorithm to replace motion estimations.

3.2 Scene Detection and Camera Parameters

Scenes can be only marked by semantic boundaries. Yeung et al. [B8] propose a time-constrained clustering algorithm to group similar shots as a scene. Rui et al. [B9] suggest a similar approach, namely time-adaptive clustering.

In the context of detection of camera parameters in the compressed domain, Zhang et al. [B1]

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

detect zoom by manipulating the motion vectors of the upper and lower rows or left and right columns. Meng & Chang [B10] combine histograms, discrete searches, and least square method to estimate camera parameters and object motion trajectories.

In [A92], Tao and Dickinson present a hierarchical template-based algorithm to retrieve satellite images which contain a template of arbitrary size, specified during a query. In [A105], shots are clustered by assessing the similarity of the DC images from representative MPEG I frames. Dimitrova and Abdel-Mottaleb [A22] proposed a method in which video sequences are temporally segmented in the compressed domain and representative frames are selected for each shot. In [A52], Meng and Chang report methods to estimate the camera motion parameters as well as detecting moving objects, using the motion vectors as well as global motion compensation. The methods reported in [A84] and [A76] also detect camera operations such as zoom and pans from MPEG coded images. In the indexing scheme developed by Iyengar and Lippman [A34] segments of 16 64x64 pixels frames were indexed by 8-dimensional feature-vectors, characterizing motion, texture and colour properties of the segments. In [A66], the variance of the first 8 AC coefficients within 8*8 blocks is proposed as a texture feature. In the photo-book system [A60], the Karhunen-Loewe decomposition was implemented in the pixel domain. In [A80], it is shown how the Euclidean distance between two vectors in an eigenspace is a measure of the correlation between the two corresponding vectors in their original space.

Clearly, compressed images are a rich source of information and a whole range of analyses are possible in the compressed domain. Saur et al. [A76] have shown that most of the analysis performed in the spatial domain can be envisaged in the compressed domain. In [A41], a temporal segmentation scheme is carried out on subband encoded videos. Liang et al. have presented a multiresolution indexing and retrieval approach for images compressed with a wavelet-based scheme [A43]. In [A30], a Vector Quantization coding scheme is proposed and shot detection as well as video retrieval are shown to be possible.

3.3. Key Frame Extraction

To reduce the complexity of the video indexing and retrieval problem key frames are used to represent each shot. For this, Han & Tewfik [B11] perform principle component analysis on video sequences and derive two discriminants from the first few retained principle components. Xiong et al. [B12] propose a more compact way of selecting key frames. They searche key frames sequentially and then extend the representative range of the key frames as far as possible. Gresle & Huang [B13] suggest selecting the frame with minimum temporal difference between two local maximals of desired distance as key frame.

3.4 Extraction of Semantic Descriptors

To extract semantic features from compressed video, some studies have been conducted on motion picture grammars [B14], video summary [B15], and standard descriptions of multimedia objects [B16]. Yeo & Yeung [B17] present an approach to construct scene transition graph (STG) based on visual similarity and temporal relationships among shots. Yeung & Yeo [B15] also describe a similar heuristic approach to produce video poster automatically. Smith & Kanade [B18] propose a method that incorporates textual indexing techniques.

3.5 Key Frame Indexing

Due to the large-scale database (e.g., WebSeek [B19], ImageRover [B20] contains more than 650,000 images), direct features extraction in the compressed domain for fast indexing and retrieval are preferable. Recently, Ngo et al. [B21] present an approach to extract shape,

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

texture and color features directly in the DCT domain of JPEG. The focus of image indexing has also been shifted from finding the optimal features to constructing the interactive mechanisms capable of modeling human perception subjectivity. In this context, Rui et al. [B22] investigate relevancy feedback in order to determine the appropriate features and similarity measures for retrieval. The most relevant image primitives used for indexing and retrieval are colour, texture and shape.

Usually, colour information is extracted from DC values and used for histogram features computation. In JPEG images AC coefficients are used for texture retrieval. Hsu et al. [B23] extract 48 statistical features to classify man-made and natural images. Wavelet packet anal-ysis is also widely applied to index textures [B24]. This approach supports hierarchical search of images with filtering capability. Ngo et al. [B21] suggests a shape indexing technique in the DCT domain. This approach generates the image gradient from the first two AC coefficients, tracking the contour of the underlying object, and then computing the invariant contour moments for indexing. The computed features are invariant to scaling and translation.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

4. Video Segmentation Using Shape Modeling4.1 Active Contour Segmentation

The purpose of segmentation is to isolate an object (or several objects) of interest in an image or a sequence. Given an initial contour (a closed curve), the active contour technique consists in applying locally a force (or displacement, or velocity) such that the initial contour evolves toward the contour of the object of interest. This force is derived from a characterization of the object formally written as a criterion to be optimized.

4.1.1 Boundary-Based Active Contours

In boundary-based active contour techniques, the object is characterized by properties of its contour only. The original active contour developments were called snakes [F1]. Only the convex hull of objects could be segmented because these techniques were based on a minimum length penalty. In order to be able to segment concave objects, a balloon force was heuristically introduced. It was later theoretically justified as a minimum area constraint [F2] balancing the minimum length constraint. The geodesic active contour technique [F3] is the most general form of boundary-based techniques. The contour, minimizing the energy, can be interpreted as the curve of minimum length in the metric defined by a positive function ``describing'' the object of interest. If this function is the constant function equal to one, the active contour evolution equation is called the geometric heat equation by analogy with the heat diffusion equation. The ``describing'' function can also be a function of the gradient of the image. In this case, the object contour is simply characterized by a curve following high gradients. As a consequence, the technique is effective only if the contrast between the object and the background is high. Moreover, high gradients in an image may correspond to the boundaries of objects that are not of interest. Regardless of the ``describing'' function, information on the boundary is too local for segmentation of complex scenes. A global, more sophisticated object characterization is needed.

4.1.2 Region-Based Active Contours

In order to better characterized an object (and to be less sensitive to noise), region-based active contour techniques were proposed [F4, F5]. A region is represented by parameters called ``descriptors''. Two kinds of region are usually considered: The object of interest and the background. Note that region-based/boundary-based hybrid techniques are common [F6, F7, F8, F9]. In the general case, descriptors may depend on their respective regions, for instance, statistical features such as mean intensity or variance within the region [F10]. The general form of a criterion includes both region-based and boundary-based terms. Classically the integral on domains are reduced to an integral along the contour using the Green-Riemann theorem [F6, F11, F12, F8, F13, F14] or continuous media mechanics techniques [F15, F40, F16]. Two active contour approaches are possible to minimize the resulting criterion:

(i) It is possible to determine the evolution of an active contour (from an iteration to the next one) without computing a velocity: The displacement of a point of the contour is chosen among small random displacements as the one leading to the (locally) optimal criterion value [F6, F11]. However this implies to compute the criterion value several times for each point;

(ii) Alternatively, differentiating the criterion with respect to the evolution parameter allows to find an expression of the appropriate displacement (or velocity) for each point [F12,F8,F13,F14]. In this case the region-dependency of the descriptors must be taken into

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

account in the derivation of the velocity. It is shown that it induces additional terms leading to a greater accuracy of segmentation [F38, F39]. The development is general enough to settle a framework for region-based active contour. It is inspired by shape optimization techniques [F17, F18]. If the region-dependency of the descriptors is not considered [F15, F40, F16, F12, F8, F13, F14], some correctional terms in the expression of the velocity may be omitted.

4.2 Active Contour Implementation

4.2.1 From Parametric To Implicit

The first implementations of the active contour technique were based on a parametric (or explicit) description of the contour [F1] (Lagrangian approach). However, management of the evolution, particularly topology changes and sampling density along the contour, is not simple [F24]. Instead, an Eulerian approach known as the level set technique [F25, F26] can be used. In two

dimensions, the contour is implicitly represented as the intersection of a surface with the plane of elevation zero. The contour can also be seen as the isocontour of level zero on the surface. In three dimensions, the contour is the isosurface of level zero in a volume. In n dimensions, the contour is the hyperplane of level zero with the space filled in with the values of a real, continuous function. Note that the contour can actually be composed of several closed contours without intersections with each other. By a continuous change of function of elavation, a contour can appear or disappear without explicit handling. Unfortunately, the level set technique has a high computational cost and extension of the velocity to levels other than the zero level is not straightforward [F27] (but it is theoretically necessary). Moreover, a curvature term (minimum length penalty) is usually added to the velocity expression in order to decrease the influence of image noise on the evolution. However, the curvature being a second derivative term, its numerical approximation is usually not accurate.

4.2.2 Splines: Back To The Parametric Approach

A cubic B-spline has several interesting properties: It is a C2 curve [F28], it is an interpolation curve that minimizes the square of the second derivative of the contour --which is close to the square curvature [F29] -- with the constraint that the contour passes through sampling points, and it has an analytical equation (defined by control points) between each pair of consecutive sampling points. The velocity has to be computed at the sampling points only. If the sampling is regular, the normal (and the curvature if needed) can be computed using an exact, fast (recursive filtering) algorithm applied to the control points. Therefore, the spline implementation is much less time consuming than the level set technique [F30, F31, F41]. Moreover, the minimum curvature-type term property helps in decreasing the influence of noise without the need to add a curvature term to the velocity. Nevertheless, noise in the image still implies noise in the velocity which, if sampling is fine, usually leads to an irregular contour because despite the smooth curvature property a cubic B-spline is an interpolation curve. A smoothing spline approach can deal with this problem by providing an approximation curve controlled by a parameter balancing the trade-off between interpolation error and smoothness [F32, F33]. As with cubic B-splines, normal and curvature can be computed exactly and efficiently.

4.3 Examples Of Applications

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

4.3.1 Image Segmentation

For this application, the test sequence shows a man giving a call on a mobile phone. Two descriptors are used for the segmentation: A variance descriptor for the face (statistical descriptor) and a shape of reference descriptor for the constraint (geometrical descriptor). Combination of these descriptors implies a competition between the shape prior and the statistical information of the object to be segmented. If the shape of reference constraint is omitted, the face segmentation includes part of the hand of the character and does not include the lips. A shape of reference is heuristically defined allowing to segment accurately the face.

4.3.2 Sequence Segmentation

The well-knwon ``Akiyo'' sequence is used for this application. The descriptor of the domain inside the contour in the segmentation criterion is a parameter acting as a constant penalty. The descriptor of the domain outside the contour is defined as the difference between the current image and a robust estimate of the background image using the several previous images [F34]. This descriptor takes advantage of the temporal information of the sequence: It is a motion detector. In case of a moving background (due to camera motion), a mosaicing technique can be used to estimate the background image [F35].

4.3.3 Tracking

The standard sequence ``Erik'' is used for this application. Given a segmentation of Erik's face in the first image, the purpose is to track the face throughout the sequence using the segmentation of the previous image to constrain the segmentation of the current frame [F21, F37]. Two descriptors are used: A variance descriptor for the face (statistical descriptor) and a shape of reference descriptor for the constraint (geometrical descriptor). The shape of reference in the current image is defined as an affine transform (translation, rotation, and scaling) of the segmentation contour in the previous image. The affine transform can be interpreted as the global motion combined with a global deformation (scaling) of the object of interest between the previous and the current image (other choices can be made for separation of the overall motion from the deformation [F36]). It is computed by a block matching method (ZNSSD criterion) applied to the points of the segmentation contour in the previous image in order to find their corresponding points in the current image. The resulting contour is used both as the shape of reference and the initial contour of the active contour process for segmentation of the current image.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

5. Image Indexing in the Spatial DomainIn [B29] image annotation or indexing is defined as the process of extracting from the video data the temporal location of a feature and its value. As explained previously, indexing images is essential for providing content based access. Indexing has typically been viewed either from a manual annotation perspective or from an image sequence processing perspective. The indexing effort is directly proprtional to the granularity of video access. Existing work on content based video access and image indexing can be grouped into three main categories: High Level Indexing, Low Level Indexing and Domain Specific Indexing.

The work by Davis [B33, B34, B35] is an example of high level indexing. This approach uses a set of predefined index terms for annotating video. The index terms are organized based on a high level ontological categories like action, time, space, etc. The high level indexing techniques are primarily designed from the perspective of manual indexing or annotation. This approach is suitable for dealing with small quantities of new video and for accessing previously annotated databases. Low level indexing techniques provide access to video based on properties like color, texture etc. These are the most disseminated techniques in the literature. Domain specific techniques use the high level structure of video to constrain the low level video feature extraction and processing. These techniques are effective only in a specific application domain and for that reason they have a narrow range of applicability. One of the pioneering works in this area is by Swanberg et al. [B31, B32]. They have presented work on finite state data models for content based parsing and retrieval of news video. Smoliar el al [B30] have also proposed a method for parsing news video. Underpinning all indexing techniques in the spatial domain are different processing tasks and methodologies ranging from data base management to low level image understanding. Reviews on video database management can be found in [B36, B37, B38, B39]. A more detailed description of the most relevant technique for database management in the context of indexing and retrieval will be given later in this report. Regarding low level processing technique including segmentation, visual primitives and similarity metrics for image descriptors, the most relevant works from the literature are referred in the next subsections.

5.1 Segmentation

Another important aspect of content based indexing is the need for spatial segmentation. Advanced indexing and retrieval systems aim to use and present video data in a highly flexible way resembling the semantic objects humans are used to dealing with. Image segmentation is one of the most challenging task in image processing. In [B40] an advanced segmentation tool box is described. In that work Izquierdo and Ghanbari present a number of important techniques that can be employed to carry out the segmentation task. The goal is to develop a system capable of solving the segmentation problem in most situations encountered in video sequences taken from real world scenes. For this aim the presented segmentation toolbox comprises techniques with different levels of tradeoff between complexity and degrees of freedom. Each of these techniques has been implemented as an independent module. Four different schemes containing key components tailored for diverse applications constitute the core of the system. The first scheme consists of very-low complexity techniques for image segmentation addressing real-time applications under specific assumptions, e.g., head-and-shoulder video images from usual videoconferencing situations, and background/foreground separation in images with almost uniform background. The techniques implemented in this scheme are basically derived from simple interest operators for recognition of uniform image areas, and thresholding approaches [B45], [B63]. The

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

methods are based on the assumption that foreground and background can be distinguished by their gray level values, or that the background is almost uniform. Although this first scheme seems to be simplistic its usefulness is twofold: firstly, it is very important and fundamental in real-time applications in which only techniques with very low degree of complexity can be implemented; and secondly, the complexity of other implemented techniques can be strongly reduced if uniform image areas are first detected. The second scheme is concerned with multiscale image simplification by anisotropic diffusion and subsequent image segmentation of the resulting smoothed images. The mathematical model supporting the implemented algorithms is based on the numerical solution of a system of nonlinear partial differential equations introduced by Perona and Malik [B62] and later extended by several other authors [B41], [B46], [B47], [B49]. The idea at the heart of this approach is to smooth the image in directions parallel to the object boundaries, inhibiting diffusion across the edges. The goal of this processing step is to enhance edges keeping their correct positions, reducing noise and smoothing regions with small intensity variations. Theoretically, the solution of this nonlinear partial differential equation with the original image as initial value tends to a piece-wise constant surface when the time (scale) tends to infinity. To speed up the convergence of the diffusion process a quantization technique is applied after each smoothing step. The termination time of the diffusion process and the quantization degree determines the level of detail expressed in the segmented image.

Object segmentation consists of extraction of the shape of physical objects projected onto the image plane, ignoring edges due to texture inside the object borders. This extremely difficult image processing task differs from the most basic segmentation problem usually formulated as separation of image areas containing pixels with similar intensity, in the objective of the task itself. While the result of segmentation can be a large number of irregular segments (based only on intensity similarity), object segmentation tries to recognize the shapes of complete physical objects present in the scene. This is the task addressed in the third and fourth schemes of the segmentation toolbox presented in [B40]. In most cases this more general segmentation cannot be carried out without additional information about the structure or dynamic of the scene. In this context most approaches for object segmentation can be included in two broad classes. The first one concerns methods for extraction of object masks by means of multiview image analysis on sequences taken from different perspectives, e.g., stereoscopic images, exploiting the 3D structure of the scene [B42], [B51], [B52], [B55]. The second is motion-based segmentation when only monoscopic sequences are available [B44], [B53], [B59], [B64]. In the later case the dynamics of objects present in the scene is exploited in order to group pixels that undergo the same or similar motion. Because most natural scenes consist of locally rigid objects and objects deforming continuously in time it is expected that connected image regions with similar motion belong to a single object.

Motion Driven Segmentation In recent years, great efforts have been made to develop disparity or motion-driven methods for object segmentation. Among others, Francois and Chupeau [B51] present a paper in which a depth-based segmentation algorithm is introduced. In contrast to our segmentation methods, they use a Markovian statistical approach to segment a depth map obtained from a previously estimated dense disparity field and camera parameters. Ibenthal et al. [B53] describe a method in which unlike the contour-matching approach realized in this work, a hierarchical segmentation scheme is applied. The motion field is used in order to improve the temporal stability and accuracy of segmentation. Chang et al. [B48] introduced a Bayesian framework for simultaneous motion estimation and segmentation based on a representation of the motion field as the sum of a parametric field and a residual field. Borshukov et al. [B44] present a multiscale affine motion segmentation based on block affine modeling. Although in all these works the dynamic of the objects

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

present in the scene is used to enhance the segmentation results, the extraction of accurate object masks is not challenged because less attention is paid to the spatial reconstruction of object contours as basis for object mask determination. In this context, Izquierdo and Kruse [B55] describe a method for accurate object segmentation tailored for stereoscopic sequences using disparity information and morphological transformations.

Still Image Segmentation The overall segmentation process of a 2D image can be seen as three major steps [B70]: simplification, feature extraction and decision. The simplification step aims to remove, from the image to be segmented, information that is undesired for the given application, or the specific algorithm employed by the decision step. In the feature extraction step, the simplified image is used for the calculation of the pixel features such as intensity and texture; this way, the feature space to be used by the decision step is formed. Finally, in the decision step, the image is segmented to regions by partitioning the feature space so as to create partitions that comply with a given set of criteria.

The first step, which simplifies the image by reducing the amount of information it contains, typically employs well-known image processing techniques such as low-pass, median or morphological filtering. Such techniques can be effectively used for reducing intensity fluctuations in textured parts of the image and for removing pronounced details that fall below a chosen size threshold. Nevertheless, such preprocessing, particularly low-pass filtering, can also affect region boundaries by smoothing them, thus making their accurate detection harder. Recently, new methods have been developed to alleviate this problem; these do not perform simplification before feature extraction; rather, they treat it as part of the feature extraction process, in order to take advantage of already calculated features. This is also demonstrated in [B71], where a moving average filter that alters the intensity features of a pixel, is conditionally applied based upon the estimated texture features of that pixel. Additionally, the simplification step can even be seen as an inherent part of the decision step, as in the method of anisotropic diffusion presented in [B62, B72]. Very good results can also be obtained using edge-preserving morphological filtering, as in [B73] and [B74] where a computationally efficient translation-invariant method is developed.

The feature extraction step serves the purpose of calculating the pixel features that are necessary for partitioning the image. Depending on the selected feature space, the process of feature extraction can be as straightforward as simply reading the RGB intensity values of each pixel, or quite complex and computationally intensive, in cases of high-dimensional feature spaces. Other than intensity features, which are always used in some form, texture features have also been recently introduced and have been demonstrated to be of importance in still image segmentation [B71, B75]. A wide range of texture feature extraction methods and several strategies for exploiting these features have also been proposed. Contour information can also be used as part of the employed feature space, to facilitate the formation of properly shaped regions. In addition to these features, position features, (i.e. the spatial coordinates of each pixel in the image grid), have also proved to be useful for the formation of spatially connected regions. For this reason, in some approaches as [B71, B75], spatial features have been integrated with intensity and texture features.

Intensity features can be as simple as the RGB intensity values, as RGB was the initial choice for the segmentation process color space. Recently, though, other color spaces, such as CIE Lab and CIE Luv have proven to be more appropriate than the RGB color space for the application of image segmentation. This is due to them being approximately perceptually uniform, (i.e. the numerical distance in these color spaces is approximately proportional to the perceived color difference), which is not the case for the RGB color space. Both CIE Luv and CIE Lab color spaces have been used for image segmentation in many approaches [B76, B77,

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

B71, B75]. Transformation from RGB to these color spaces can be achieved through the CIE XYZ standard, to which they are related through either a linear (CIE Luv) or a non-linear transformation (CIE Lab).

Texture features are an important addition to intensity features, since they can be used to allow for chromatically non-uniform objects to be described by a single region, uniform in terms of texture. This way, over-segmentation caused by breaking down such objects to chromatically uniform regions can be avoided. Several strategies have been proposed for extracting texture features; these can be classified to three major categories: statistical, structural and spectral. Statistical techniques characterize texture by the statistical properties of the gray levels of the points comprising a surface. Typically, these properties are computed from the gray level histogram or gray level cooccurrence matrix of the surface. Most statistical techniques ignore the spatial arrangement of the intensity values in the image lattice; for that, their use in segmentation is limited. Structural techniques, on the contrary, characterize texture as being composed of simple primitives called texels (texture elements), which are regularly arranged on a surface according to some rules. These rules are defined by some form of grammar. Structural techniques are often difficult to implement. Spectral techniques are the most recent addition to texture description techniques; they are based on properties of the Fourier spectrum and describe the periodicity of the gray levels of a surface by identifying high-energy peaks in the spectrum. Several spectral techniques have received significant attention in the past few years, including the Discrete Wavelet Frames [B78] and the Discrete Wavelet Packets [B79] decompositions; these can effectively characterize texture in various scales and are used in most recent segmentation algorithms [B77, B71].

As soon as the feature space has been chosen and appropriate features have been calculated, as already discussed, a decision step must be employed to appropriately partition the feature space; this decision step is enforced via the application of a segmentation algorithm. Segmentation algorithms for 2D images may be divided primarily into homogeneity-based and boundary-based methods [B70]. Homogeneity-based approaches rely on the homogeneity of spatially localized features such as intensity and texture. Region-growing and split and merge techniques also belong to the same category. On the other hand, boundary-based methods use primarily gradient information to locate object boundaries. Several other techniques that are difficult to classify to one of these categories have also been proposed.

An important group of homogeneity-based segmentation methods includes split-based and merge-based methods. Given an initial estimation of the partitions, that may be as rough as having all elements gathered in a single partition, or every element associated to a different partition, the actions of splitting a region to a number of sub-regions or merging two regions to one are applied, so as to create partitions that better comply with the chosen homogeneity criteria. Combining these two basic processes, the Split&Merge technique applies a merging process to the partitions resulting from a split step. Due to its rigid quadtree-structured split and merge process, the conventional split&merge algorithm lacks the adaptability to the image semantics, reducing the quality of the result [B80]. This problem was solved in [B81, B82], where edge information is integrated to the split&merge process either by piecewise least-square approximation of the image intensity functions or via edge preserving prefiltering.

A region growing strategy that has lately received significant attention, replacing the rigid split&merge process, is the watershed algorithm, which analyzes an image as a topographic surface, thus creating regions corresponding to detected catchment basins [B83, B84]. If a function f is a continuous height function defined over an image domain, then a catchment basin is defined as the set of points whose paths of steepest descent terminate at the same

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

local minimum of f. For intensity-based image data, the height function f typically represents gradient magnitude. The watershed algorithm proceeds in two steps. First, an initial classification of all points into regions corresponding to catchment basins is performed, by tracing each point down its path of steepest descent to a local minima. Then, neighboring regions and the boundaries between them are analyzed according to an appropriate saliency measure, such as minimum boundary height, to allow for mergings among adjacent regions. The classical watershed algorithm tends to result in over-segmentation, caused by the presence of an excessive number of local minima in function f. While several techniques rely on merging adjacent regions according to some criteria [B85, B86] in order to combat over-segmentation, more recent variants of the watershed algorithm alleviate this problem by being modified so as to deal with markers [B87]. Alternatively, the waterfall technique can be used to suppress weak borders and thus reduce oversegmentation [B88, B89].

Another approach to homogeneity-based image segmentation makes use of a K-Means family algorithm to classify pixels to regions. Clustering based on the classical K-Means algorithm is a widely used region segmentation method which, however, tends to produce unconnected regions. This is due to the propensity of the classical K-Means algorithm to ignore spatial information about the intensity values in an image, since it only takes into account the global intensity or color information. To alleviate this problem, the K-Means-with-connectivity-constraint (KMCC) algorithm has been proposed. In this algorithm the spatial features of each pixel are also taken into account by defining a new center for the K-Means algorithm and by integrating the K-Means with a component labeling procedure. The KMCC algorithm has been successfully used for model-based image sequence coding [B90] and content-based watermarking [Boulgouris02] and has been used in conjunction with various feature spaces that combine intensity and texture or motion information.

As far as boundary-based methods are concerned, their objective is to locate the discontinuities in the feature space that correspond to object boundaries. For that, several edge detectors have been proposed, such as the Sobel, Roberts and Canny operators [B91, B93]. The Canny operator is probably the most widely used algorithm for edge detection in image segmentation techniques. The main drawback of such approaches is their lack of robustness; failure to detect a single element of a region contour may lead to undesired merging of regions. This problem can be alleviated with the use of whole boundary methods [B93], which rely on the values of gradients in parts of the image near an object boundary. By considering the boundary as a whole, a global shape measure is imposed, thus gaps are prohibited and overall consistency is emphasized. One of the most popular methods of detecting whole boundaries is the active contour models or snakes approach [B94], where the image gradient information is coupled with constraints and the termination of the boundary curve evolution is controlled by a stopping edge-function. An interesting variation of this framework is proposed in [B95], where the use of a stopping term based on Mumford-Shah segmentation techniques is proposed. This way, the resulting method is capable of detecting contours both with or without gradient, thus being endowed with the capability to detect objects with very smooth boundaries.

In [B96], a different approach is proposed: a predictive coding scheme is employed to detect the direction of change in various image attributes and construct an edge flow field. By propagating the edge-flow vectors, the boundaries can be detected at image locations that encounter two opposite directions of flow. This approach can be used for locating the boundaries between not only intensity-homogeneous objects but also between texture-homogeneous objects, as opposed to other boundary-based methods, that are focused on utilizing intensity information only.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Other important methods that are difficult to classify either to homogeneity-based or to boundary-based approaches include segmentation using the Expectation-Maximization (EM) algorithm, using the Markov Chain, segmentation by anisotropic diffusion, and hybrid techniques that combine homogeneity and boundary information.

One of the most widely known segmentation algorithms for content-based image indexing is the Blobworld algorithm [B75], which is based on the Expectation-Maximization (EM) algorithm. The EM algorithm is used for many estimation problems in statistics, to find maximum likelihood parameter estimates when there is missing or incomplete data. For image segmentation, the missing data is the cluster to which the points in the feature space belong. In [B75], the EM algorithm is used for segmentation in the combined intensity, texture and position feature space. For this to be achieved, the joint distribution of color, texture and position features is modeled with a mixture of Gaussians. The EM algorithm is then used to estimate the parameters of this model and the resulting pixel-cluster memberships provide a segmentation of the image.

Using Markov Chains provides another interesting means to perform image segmentation [B97, B98]. Very promising results are presented in [B99], where the Data-Driven Markov Chain Monte Carlo (DDMCMC) paradigm is developed; ergodic Markov Chains are designed to explore the solution space and data-driven methods, such as edge detection and data clustering, are used to guide the Markov Chain search. The data-driven approach results to significant speed-up in comparison to previous Markov Chain Monte Carlo algorithms. Additionally, the DDMCMC paradigm provides a unifying framework in which the role of many other segmentation algorithms such as edge-detection, clustering, split&merge etc can be explored.

Another segmentation method for 2D images is segmentation by anisotropic diffusion [B72]. Anisotropic diffusion can be seen as a robust procedure which estimates a piecewise smooth image from a noisy input image. The edge-stopping function in the anisotropic diffusion equation, allows the preservation of edges while diffusing the rest of the image. This way, noise and irrelevant image details can be filtered out, making it easier for a segmentation algorithm to achieve spatial compactness while retaining the edge information. In [B100] the problem of color image segmentation is addressed by applying two independent anisotropic diffusion processes, one to the luminance and one to the chrominance information, and subsequently combining the segmentation results.

Hybrid techniques that integrate the results of boundary-based and homogeneity-based approaches have also been proposed, to combine the advantages of both approaches and gain in accuracy and robustness. In [B101], an algorithm named region competition is presented. This algorithm is derived by minimizing a generalized Bayes/minimum description length (MDL) criterion using the variational principle, and combines aspects of snakes/balloons and region growing; the classic snakes and region growing algorithms can be directly derived from this approach. In [B102], the snakes approach is coupled with the watershed approach, which is used to restrict the number of edge curves that are to be considered by the snake algorithm, by eliminating unnecessary curves, while preserving the important ones; this results to increased time-efficiency of the segmentation process. Very promising results are presented in [B103], where the color edges of the image are first obtained by an isotropic color-edge detector and then the centroids between the adjacent edge regions are taken as the initial seeds for region growing. Additionally, the results of color-edge extraction and seeded region growing are integrated to provide more accurate segmentation.

5.2 Low-Level Visual Features

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Low-level visual features refers to a single category of visual properties: colour, texture, shape, motion. Visual features are the basic cues our visual system uses. In [A89], colour is presented as a powerful cue, which had until recently unjustly been neglected to the benefit of geometrical cues in object recognition applications. However, a radical evolution is demonstrated by the current widespread use of colour in content-based and recognition applications.

These of colour in image displays is not only more pleasing, but it also enables us to receive more visual information. While we can perceive only a few dozen grey levels, we have the ability to distinguish between thousand of colours. Colour representation is based on the classical theory of Thomas Young (1802) and further developed by scientists starting from Maxwell (1861) to more recent ones such as MacAdam (1970), Wyzechi and Stiles (1967), and many more. The colour of an object depends not only on the object itself, but also on the light source illuminating it, on the colour of the surrounding area, and on the human visual system. Light is the electromagnetic radiation that stimulates our visual response. It is expressed as a spectral energy distribution L(), where is the wavelength that lies in the visible region, 350 nm to 780 nm, of the electromagnetic spectrum. Achromatic light is what we see on a black-and-white television set or display monitor. An observer of achromatic light normally experiences none of the sensations we associate with red, blue, yellow, and so on. Quantity of light is the only attribute of achromatic light, which can be expressed by intensity and luminance in the physics sense of energy, or brightness in the psychology sense of perceived intensity. The visual sensations caused by coloured light are much richer than those caused by achromatic light. The perceptual attributes of colour are brightness, hue, and saturation. Brightness represents the perceived luminance. The hue of a colour refers to its "redness", "greenness", and so on. Saturation is that aspect of perception that varies most strongly as more and more white light is added to a monochromatic light (ibid.). The human visual system includes the eye, the optic nerve, and parts of the brain. This system is highly adaptive and non-uniform in many respects, and by recognising and compensating for these non-uniformities we can produce improved displays for many images.

Colour is in fact a visual sensation produced by light in the visible region of the spectrum incident on the retina [A96]. It may be defined in many different spaces; the human visual system has three types of colour photoreceptor cells, so colour spaces need only be three-dimensional for an adequate colour description. The RGB (Red Green Blue) space is the basic space in which the pixels of coloured digital images are usually defined. However other spaces have been exploited: L*u*v*, HVC, Munsell colour space, the Itten-Runge sphere, etc. A good interpretation of some of these colour spaces can be found in [A96]. Components in these spaces are derived from the RGB components with the help of appropriate transforms. Each space has different properties, and thus is advocated for when it best suits the application. Some spaces have been shown to agree more closely to the human perception of colour.

Texture is defined by the repetition of a basic pattern over a given area. This basic pattern, referred to as ‘texel’, contains several pixels whose placement could be periodic, quasi-periodic or random [A37]. Texture is a material property, along with colour. Colour and texture are thus commonly associated for region and object indexing, and recognition. As a material property, texture is virtually everywhere. Picard identifies three properties of texture: lack of specific complexity, presence of high frequencies, and restricted scale [A63]. She even extends the definition of texture to a characteristic of sound and motion as well as a visual characteristic. Thus, texture can be defined in time: in her PhD thesis [A44], Liu defines temporal texture as ‘motion patterns of indeterminate spatial and temporal extent’. Examples of temporal textures are periodic temporal activities: walking, wheels of a car rolling on the

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

road... In artificial or human made environments, visual texture tends to be periodic, deterministic. In natural scenes, textures are generally random (a sandy beach for instance), non deterministic. The ‘ubiquity of texture’ [A44] means that a wide range of approaches2, models and features have been defined and used to represent texture, in content-based retrieval applications especially. A review of some of them is also given in [A63]. Most of these representations try to approximate or agree with human perception of texture as much as possible.

The shape of an object refers to its profile or form [A31, A37]. Shape is a particularly important feature when characterising objects. It is also essential in certain domains where the colour and texture of different objects can appear similar, e.g. medical images [A31].

Finally, Motion can be a very powerful cue to recognise and index objects, although it can only be used for video applications [B44], [B53], [B59].

5.3 Colour Descriptors

Colour feature is one of the most widely used visual features in Image Retrieval. Several methods for retrieving images on the basis of colour similarity have been described in the literature. Some representative studies of colour perception and colour spaces can be found in [C1, C2, C3]. The colour histogram is often used in image retrieval systems due to its good performance in characterizing the global colour content. Statistically, it denotes the joint probability of the intensities of the three colour channels. The matching technique most commonly used, Histogram Intersection, was first developed by Swain and Ballard [C4]. This method proposes a L1 metric as the similarity measure for the Colour Histogram. In order to take into account the similarities between similar but not identical colours, Ioka [C5] and Niblack et al. [C6] introduced a L2related metric in comparing the histograms. Furthermore, Stricker and Orengo proposed the use of the cumulated Colour Histogram, stimulated by the attempt to make the afore mentioned methods more robust [C7]. The index contained the complete colour distributions of the images in the form of cumulative colour histograms. The colour distributions were compared using the L1-, the L2-, or the L - metric. Besides Colour Histogram, several other colour feature representations have been applied in Image Retrieval, including Colour Moments and Colour Sets. Instead of storing the complete colour distributions, Stricker and Orengo proposed to use Colour Moments approach [C7]. Instead of storing the complete colour distributions, the index contained only their dominant features. This approach was implemented by storing the first three moments of each colour channel of an image The similarity function used for the retrieval was a weighted sum of the absolute differences between corresponding moments. Smith and Chang proposed Colour Sets to facilitate the search of large-scale image and video databases [C8, C9]. This approach identified the regions within images that contain colours from predetermined colour sets. The (R, G, B) colour space was first transformed into a perceptually uniform space, such as HSV, and then quantized into M bins. Colour Sets correspond to salient image regions, and are represented by binary vectors to allow a more rapid search.

Methods of improving on Swain and Ballard’s original technique include also the use of region-based colour querying [C51, C61]. In these cases, the colour is described by a histogram of L bins of the colour coordinates in any colour space. In order to help the user formulate effective queries and understand their results, as well as to minimize disappointment due to overly optimistic expectations of the system, systems based on this method [C51, C61] display the segmented representation of the submitted image and allow the user to specify which aspects of that representation are relevant to the query. Therefore, when the desired region has been selected, the user is allowed to adjust the weights of each feature of the selected region.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

The main weakness of all the indexing methods described above is the lack of spatial information in the indices. Many research results suggested that using both colour feature and spatial relations is a better solution. The simplest way to provide spatial information is to divide the image into sub-images, and then index each of these [C46, C47]. A variation of this approach is the quad-tree based colour layout approach [C48], where the entire image was split into quad-tree structure and each tree branch had its own histogram to describe its colour content. This regular sub-block based approach cannot provide accurate local colour information and is computation and storage expensive. A more sophisticated approach is to segment the image into regions with salient colour features by Colour Set Back projection, and then store the position and Colour Set feature of each region to support later queries [C8]. The advantage of this approach is its accuracy while the disadvantage is the general difficult problem of reliable image segmentation. In [C50], Stricker and Dimai have split the image into a oval central region and four corners. They extracted the first three colour moments from these regions attributing more weight to the central region. The usage of the overlapping region made their approach relatively insensitive to small regions transformations. Spatial Chromatic Histograms [C49] combine information about the location of pixels of similar colour and their arrangement within the image with that provided by the classical colour histogram. Mitra et al [C57] proposed the colour correlograms as colour features, which include the spatial correlation of colours and can be used to describe the global distribution of the local correlations.

In general a 3D color histogram is used to represent the color distribution of an image. Both the color space adopted and the number of bins of the histogram used to describe the color distribution may influence the recognition rate. But it is the matching strategy that most distinguishes the different methods. Stricker [D34] has shown that using the L1 norm for evaluating histogram similarity may produce false negatives (i.e. not all the images similar to the query are retrieved), while applying the L2 norm may result, instead, in false positives (i.e. images not similar to the query are retrieved) [D37]. Hafner et al. [D17] have proposed a L2 related metric that results in a smaller set of false negatives. We have addressed color image indexing [D3, D4] using perceptual correlates of the psychological dimensions of Lightness, Chroma, and Hue. Extending this work to deal with unsegmented pictorial images [D2], we found experimentally that observers disagreed in evaluating color similarity, and that the set of similar images found by browsing the original images was far from coinciding with that obtained by browsing the randomized version of the database (where the original image structure was changed, but not the color distribution). This proves that in the case of some images observers are unable to assess color information independently of other perceptual features, such as shape and texture. Stricker [D35] has proposed the use of boundary histograms, which encode the lengths of the boundary between different discrete colors, in order to take into account geometric information in color image indexing. But this boundary histogram method may yield a huge feature space (for a discrete color space of 256 elements, the dimension of the boundary histogram is 32,768) and is not robust enough to deal with textured color images. Gagliardi and Schettini [D14] have investigated the use and integration of different color information descriptions and similarity measurements to improve system effectiveness. In their method both query and database images are described in CIELAB color space [D43], with two limited palettes of perceptual significance, of 256 and 13 colors respectively. A histogram of the finer color quantization and another of the boundary lengths between two discrete colors of the coarser quantization are used as indices of the image. While the former contains absolutely no spatial information, but describes only the color content of the image, the latter provides a concise description of the spatial arrangement of the basic colors in the image. Suitable procedures for measuring the similarity

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

between histograms are then adopted and combined in order to model the perceptual similarity between the query and target images.

Stricker has proposed two other approaches more efficient than those based on color histograms [D6, D35]: in the first, instead of computing and storing the complete 3D color histogram, only the first three moments of the histograms of each color channel are computed and used as an index; in the second, an image is represented only by the average and covariance matrix of its color distribution. The similarity functions used in these approaches for retrieval are a weighted sum of the absolute difference between the features computed. However these methods too neglect to take into account the spatial relationship among color pixels; consequently, images with quite a different appearance may be judged similar simply because they have a similar color composition [D2].

5.4 Texture

The ability to retrieve images on the basis of texture similarity may not seem very useful. But the ability to match on texture similarity can often be useful in distinguishing between areas of images with similar colour (such as sky and sea, or leaves and grass) [C10]. Texture contains important information about the structural arrangement of surfaces and their relationship to the surrounding environment [C11]. A variety of techniques have been used for measuring texture similarity. In the early 70's, Haralick et al. proposed the co-occurrence matrix representation of texture features [C11]. In this approach the features are based on the co-occurrence matrix, a two dimensional histogram of the spatial dependencies of neighbouring greyvalues. More specifically, co-occurrence matrix is the feature primitive for the co-occurrence texture features, most of which are moments, correlations, and entropies. Many other researcher experiments followed this approach and further proposed enhanced versions. For example, Gotlieb and Kreyszig studied the statistics originally proposed in [C11] and experimentally found out that contrast, inverse deference moment and entropy had the biggest discriminatory power [C12]. Tamura et al. explored the texture representation from a different angle [C13]. They calculate computational approximations of coarseness, contrast, directionality, linelikeness, regularity, and roughness, which were found to be important visual texture properties in psychology studies. One major difference between the Tamura texture representation and the co-occurrence matrix representation is that the texture properties in Tamura representation are visually meaningful while some of the texture properties used in co-occurrence matrix representation may not (for example, entropy). This texture representation was further improved by the QBIC system [C14] and MARS system [C15, C16].

Alternative methods of texture analysis for retrieval include the use of Wavelet transform in texture representation [C17, C18, C19, C20, C21, C22]. In [C17, C10], Smith and Chang used the mean and variance extracted from the Wavelet subbands as the texture representation. Tree-structured Wavelet transform was used by Chang and Kuo in [C18] to explore the middle-band characteristics. Researchers also combined Wavelet transform with other techniques to achieve better performance. Gross et al. used Wavelet transform, together with KL expansion and Kohonen maps [C54] and Thyagarajan et al. [C22] and Kundu et al. [C21] combined Wavelet transform with cooccurrence matrix. According to the review made by Weszka et al., the Fourier power spectrum performed poorly while the second order grey level statistics (co-occurrence matrix) and first order statistics of grey level differences were comparable [C23]. In [C24], Ohanian and Dubes compared the following types of texture representations: Markov Random Field representation [C25], multi-channel filtering representation, fractal based representation [C26], and co-occurrence representation and they found out that co-occurrence matrix representation performed best in their test sets. In a more

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

recent paper [C27], Ma and Manjunath investigated the performance of different types of Wavelet-transform based texture features. In particular, they considered orthogonal and biorthogonal Wavelet transforms, tree-structured Wavelet transform, and Gabor wavelet transform. In all their experiments the best performance was achieved using the Gabor transform, which matched the Human vision study results [C10]. Furthermore, texture queries can be formulated in a similar manner to colour queries, by selecting examples of desired textures from a palette, or by supplying an example query image. The system then retrieves images with texture measures most similar in value to the query. A recent extension of the technique is the texture thesaurus developed by Ma and Manjunath [C62], which retrieves textured regions in images on the basis of similarity to automatically derived codewords representing important classes of texture within the collection.

Most of the computational methods available for describing texture provide for the supervised or unsupervised classification of image regions and pixels. Within these contexts gray level textures have been processed using various approaches, such as Fourier transform, co-occurrence statistics, directional filter masks, fractal dimension and Markov random fields (for a review of the various methods, see [D8, D42]). Rao and Lohse have designed an experiment to identify the high level features of texture perception [D27, D28]. Their results have suggested that in human vision three perceptual features ("repetitiveness", "directionality", and "granularity and complexity") concur to describe texture appearance. Consequently, the computational model applied in image indexing should compute features that reflect these perceptual ones. To do so, the IBM QBIC system uses a modified version of the features "coarseness", "contrast" and "directionality" proposed by Tamura for image indexing [D38, D9]. Amadusun and King have proposed another feature set that corresponds to the visual properties of texture: "coarseness" "contrast", "busyness", "complexity", and "texture strength" [D1]. Picard and Liu, extending the work described in [D12, D13], have proposed an indexing scheme based on Word Decomposition of the luminance field [D20, D26] in terms of "periodicity", "directionality", and "randomness". Although they make no explicit reference to human perception, Manjunath and Ma [D22], Gimel’Farb and Jain [D16] and Smith and Chang [D31] have also made significant contributions to texture feature extraction and similarity search in large image database.

Color images must be converted to luminance images before these texture features are computed [D15, D43]. While the sharpness of an image does depend much more on its luminance than on its chrominance, some textures, such as marble and granites, require that color information be discriminated [D33].

Considering texture as the visual effect produced by the spatial variation of pixel colors over an image region, Schettini has defined a small color-texture feature set for the unsupervised classification and segmentation of complex color-texture images [D30]. The key idea of the indexing method is to use the difference in orientation between two vector colors in a orthonormal color space as their color difference measure. For each pixel of the image, the angular difference between its own vector color and the average vector color evaluated in the surrounding neighborhood, is computed to produce a gray-level "color contrast image". A set of texture features is then computed from the low-order spatial moments of the area around each pixel of the color contrast image. The texture features are used, together with the average color, (making a total of nine features) to index the image.

5.5 Shape

Colour and texture characterise the material properties of objects and regions. They represent the ‘stuff’ as opposed to ‘things’ [A24]. Ultimately shape descriptors are needed to represent objects and obtain a more semantic representation of an image. One can distinguish between

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

global descriptors, which are derived from the entire shape, and local descriptors, which are derived by partial processing of the shape and do not depend on the entire shape [A31].

Simple global descriptors include the area of the region/object, its centroid, its circularity, moments [A12, A27, A16, A59, A26, A42]. Eakins et al. [A23] extend this list to: length irregularity, discontinuity angle irregularity, complexity, aspect ration, right angleness, sharpness, directedness. For their system ARTISAN, they introduce a novel approach for shape analysis based on studies of the human perception of shape. They argue that ‘image retrieval should be based on what the eye actually sees, rather than the image itself’. So they propose that objects boundaries should be grouped into ‘boundary families’, according to some criteria such as collinearity, proximity and pattern repetition. Chang et al. [A16] introduce two more global features: the normalised area, and the percentage area, for their system VideoQ. The ratio of the area of an object is the area of the circumscribing circle; the percentage area is the percentage of the area of the video that is occupied by the object.

The use of curvature to derive shape descriptors has been explored in [A46] (curvature functions) and in [A57] (curvature scale space representation). Pentland et al., Saber and Murat-Tekalp [A71], and Sciaroff et aL. [A79] have all proposed a shape representation based on eigenvalue analysis. The method proposed by Saber and Murat-Tekalp was adopted by Chang et aL. for the system VideoQ [A16].

Although the shape of an object or a region may be indexed accurately, it is also often approximated by a simpler shape (e.g. minimum bounding rectangle, ellipse) and simple global descriptors are calculated for these representatiye shapes. This makes queries by sketch easier for the user [A16], and indexing simpler. For the QBIC system[A59], a specific methodology has been developed for queries by sketch: a reduced resolution edge map is computed and stored for each image. The maps will be compared to the map derived from the user’s sketch.

The ability to retrieve by shape is perhaps the most obvious requirement at the primitive level. Unlike texture, shape is a fairly well-defined concept – and there is considerable evidence that natural objects are primarily recognized by their shape. In general, the shape representations can be divided into two categories, boundary-based and region-based. The former uses only the outer boundary of the shape while the latter uses the entire shape region [C28]. The Fourier Descriptor performs best for boundary-based representation and proposes the use of the Fourier transformed boundary as the shape feature. Some early work can be found in [C29, C30]. The modified Fourier Descriptor, which was proposed by Rui et al., is both robust to noise and invariant to geometric transformations [C28]. In the area of region-based representation, the Moment Invariants are the most successful representative and propose the use of region-based moments, which are invariant to transformations, as the shape feature. In [C31], Hu identified seven such moments. Based on his work, many improved versions emerged [C32]. Kapur et al. developed a method to systematically generate and search for a given geometry's invariants [C33] and Gross and Latecki developed an approach, which preserved the qualitative differential geometry of the object boundary, even after an image was digitised [C33]. In [C34, C35], algebraic curves and invariants represent complex objects in cluttered scene by parts or patches. Alternative methods proposed for shape matching have included elastic deformation of templates (Finite Element Method-FEM) [C36], a Turning Function based method for comparing both convex and concave polygons [C37], Wavelet Descriptor that embraced the desirable properties such as multi-resolution representation, invariance, uniqueness, stability, and spatial localization [C38], comparison of directional histograms of edges extracted from the image [C63] and shocks, skeletal representations of object shape that can be compared using graph matching techniques [C56]. The Chamfer

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

matching technique, first proposed by Barrow et al., matched one template against an image, allowing certain geometrical transformations (e.g. translation, rotation, affine) [C39]. A number of extensions have been proposed to the basic Chamfer matching scheme. Some deal with hierarchical approaches to improve match efficiency and use multiple image resolutions [C40]. In [C41], Li and Ma proved that the Geometric Moments method (region-based) and the Fourier Descriptor (boundary-based) are related by a simple linear transformation. In [C42], Babu et al. showed that the combined representations outperformed the simple representations. In [C51] shape is represented by (approximate) area, eccentricity, and orientation.

Shape matching of three-dimensional objects is a more challenging task – particularly where only a single 2-D view of the object in question is available. Examples of methods for 3D shape representation include: Fourier descriptors [C43], use of a hybrid structural/statistical local shape analysis algorithm [C44], or use of a set of Algebraic Moment Invariants [C45] (this was used to represent both 2D and 3D shapes).

5.6 Motion

A global motion index can be defined by examining the overall motion activity in a frame [A20, A107, A21, A95, A28]. The estimation of the activity can be based on the observed optic flow motion as in [A107]; Vasconcelos and Lippman [A95] rely on the tangent distance.

Alternatively, we can distinguish between two types of motion: the motion induced by the camera, and the motion of the objects present in the scene. During the recording of a sequence a camera can: pan, tilt, or zoom. Panning refers to a horizontal rotation of the camera around its vertical axis. When tilting, the camera rotates around its vertical axis. During a zoom, the camera varies the focusing distance. Detecting the camera operations is important as it help determining the ‘objects’ absolute motion. Furthermore the type of camera motion is a hint for semantic analysis. For instance in a basketball match, the camera pans the court and follows the ball when the teams are moving from one end to the other; but then when a point is about to be scored, all the players are located around one basket so the camera is static [A76]. Also, film directors choose particular types of camera motion to convey particular impressions. It is possible to find techniques which detect both object and camera motions. These techniques rely on the estimation and analysis of the optic flow.

For instance, in the system VideoQ [A16], a hierarchical pixel-domain motion estimation method is used to extract the optic flow. The global motion components of objects in the scene are compensated by the affine model of the global motion. Panning is detected by determining dominant motions along particular directions from a global motion velocity histogram. A technique to detect zooming is also reported. Objects are segmented and tracked by fusing edge, colour and motion information: here, the optic flow is used to project and track regions (segmented with colour and edge cues) through the video sequence. For each frame and each tracked object, a vector represents the average translation of the centroid of the object between successive frames after global motion compensation. The speed of the object and the duration of motion can be determined by storing the frame rate of the video sequence. It is then interesting to note how queries based on motion properties are proposed in this system. The user can sketch one or several objects (represented by simple geometrical forms), their motion trail, and specify the duration of the motion, the attributes of the objects, the order in which they appear. To see whether a stored object’s motion matches the specified trail, its trail is uniformly sampled based on the frame rate; then its trail is either projected onto the x-y space (if the user has no clear idea about the motion duration), or left in the spatio temporal domain. The first scheme reduces the comparison of the trails to a contour matching scheme. With the second scheme, the Euclidean distances between the trail samples

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

of the stored items and the query are calculated and summed to give an overall matching score.

5.7 Retrieval by other types of primitives

One of the oldest-established methods of accessing pictorial data is retrieval by its position within an image. Accessing data by spatial location is an essential aspect of geographical information systems, and efficient methods to achieve this have been under development for many years [C52, C53]. Similar techniques have been applied to image collections where the search of images containing objects in defined spatial relationships with each other was possible [C54, C55].

Several other types of image features have been proposed as a basis for CBIR. Most of these techniques aim to extract features, which reflect some aspect of image similarity based on human perception. The most well researched technique of this kind uses the wavelet transform to model an image at several different resolutions. Promising retrieval results have been reported by matching wavelet features computed from query and stored images [C58]. Another method giving interesting results is retrieval by appearance. Two versions of this method have been developed, one for whole-image matching and one for matching selected parts of an image. The part-image technique involves filtering the image with Gaussian derivatives at multiple scales and then computing differential invariants; the whole-image technique uses distributions of local curvature and phase [C59].

The advantage of all these techniques is that they can describe an image at varying levels of detail (useful in natural scenes where the objects of interest may appear in a variety of guises), and avoid the need to segment the image into regions of interest before shape descriptors can be computed. Despite recent advances in techniques for image segmentation [C51, C60], this remains a troublesome problem.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

6. High-Level DescriptorsAlthough existing systems can retrieve images or video segments ‘based on the sole specific-ation of colour, texture or shape, these low-level descriptors are not sufficient to describe the rich content of images and restrict the field ‘of possible queries. Retrieval results remain very approximate in some cases. Schemes which capture high-level and semantic properties from low-level properties and domain knowledge have been developed.

In general, modelling the semantic content is more difficult than modelling low-level visual descriptors. For the machine video is just a temporal sequence of pixel regions without direct relation to its semantic content. This means that some sort of human interaction is needed for semantic annotation. Probably, the simplest way to model the video content is by using free text manual annotation. Some approaches [C64, C65] introduce additional video entities, such as objects and events, as well as their relations, that should be annotated, because they are subjects of interests in video. Humans think in term of events and remember different events and objects after watching video, these high-level concepts are the most important cues in content-based video retrieval. A few attempts to include these high-level concepts into a video model are made in [C66, C67].

As segmentation techniques progress it becomes possible to identify meaningful regions and objects. A further step is to identify what these regions correspond to. This is possible using low-level features, and grouping or classification techniques. [A47] uses learning strategies to group pixels into objects and classify these objects as one of several predefined types. [A10] choose an optimised-learning-rate LVQ algorithm to classify feature vectors associated with single pixels. Mo et al. utilize state transition models, which include both top-down and bottom-up processes, to recognise different objects in sports scenes [A56]. These objects will have first been segmented and characterized by low-level features. In [A42], the background of images containing people is decomposed into different classes, by comparing the available perceptual and spatial information with look-up tables.

People and face detection are an important step in the semantic analysis of images and videos. In the 1995 NSF-ARPA Workshop on Visual Information Management Systems [A38], a focus on human action was felt to be one of the most important topics to address. Since estimates showed that in over 95% of all video, the primary camera subject is a human or a group of humans, this focus is justified. Already a face detector has been included in the WebSeer WWW image retrieval engine [A90] and in the video skim generator presented by Smith and Kanade [A84]: both systems use the face detector presented by Rowley [A68]. The detector is reported to be over 86% accurate for a test set of 507 images; it can deal with faces of varying sizes, is particularly reliable with frontal faces and is thus appropriate for ‘talking-head’ images or sequences.

In [A42], studies have shown that the normalised human flesh-tone is reasonably uniform across race and tan: therefore person extraction is performed by detecting pixels with flesh-tone. The results of the flesh-detection are associated with the results of an edge analysis scheme, and simple models of the head and body shapes are employed to segment the person from the background. Malik et al. [A47] also group skin-like pixels into limbs and use a simplified kinematic model to connect the limbs.

A user may not only be interested in particular types of objects or regions, but also in their location in the image. So once regions and objects of interest have been identified, their absolute or relative positions must be determined. A simple approach for the absolute location is to determine the position of the centroid of the regions or objects (or of geometric forms approximating them). In [A46], the coordinates of the Minimum Bounding Rectangle is also

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

used, while in [A82], the evaluation of spatial locations in a query is accomplished by referring to a quad-tree spatial index. Spatial relationships between objects can be specified simply by the relative distance between their centroid. 2-D strings and their successors represent a more sophisticated way to formulate and index spatial relationships. 2-D strings (introduced by Chang et al. [A15]) require the positions of the centroid of segmented objects to be known.

However 2-D strings are point-based, as only the centroid of an object is considered. They do not take the extent of objects into account, so some relationships (e.g. overlap) are difficult to express. As a result a number of other strings were proposed and are reviewed in [A31]. The 2-D B string in particular represents each object by the start point and the end point of its projection on the horizontal and vertical axis. A set of operators is then used to describe the ordering of these points along each direction. To compare strings, two ranks are assigned to each object: one for the start point and one for the end point, and these ranks are compared during retrieval. The use of 2-D strings was first applied to images but later suggested for video [A5], by associating each frame with a string. The resulting sequence of strings would be reformulated such that the first string in the sequence remains in the standard notation, while the subsequent strings would be written in set edit notation. A similar technique using 2-B strings has been proposed by Shearer et al. [A81]; they have defined a set edit notation which encodes the initial configuration of objects and a description of their spatial changes over time. A user can thus search a sub-sequence of frames where objects are in particular configurations.

Low-level features can even be related to concepts. In a set of studies, Rao et al. [A65] have found that different human subjects perform similar classifications of texture images; the subjects also perform similiar classifications of texture words. Rao et al. have thereby derived a set of categories for texture images and another one for texture words.

Hachimura [A27], relates combinations of principal colours to words describing perceptive impressions. In the context of films, Vasconcelos and Lippman [A95] showed that the local activity of a sequence and its duration could yield a feature space in which films can be classified according to their type: action/ non-action (romance, comedy). Furthermore, it seemed’ possible to estimate the violence content of the films with these two features. The system Webseek [A83] performs Fisher discriminant analysis on samples of colour histograms of images and videos to automatically assign the images and videos to type classes. This approach was in fact reported successful in meeting. its goal. As for the WWW retrieval system WebSeer, it can distinguish photographic images from graphic images [A90]. The classification into graphics or photographs is performed using multidecision trees (trained) with tests on the colour content, size and shape of the images [A6].

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

7. Defining Metrics between Descriptors and Relevance FeedbackThe objective of most of the content-based systems is not necessarily retrieval by matching, but retrieval by similarity. There might not be one single item which can be characterised by the user’s specification among all those presents in the database. Each user may also be interested in different parts of a single image and it is currently not possible to index all the details present in an image. Furthermore, the user might not be confident in his/her own specifications. So criteria based on similarity make retrieval systems more flexible with respect to all these aspects.

In [A74], Santini and Jam take the view that retrieval by similarity should prevail over retrieval by matching. They argue that is important for a system to be able to evaluate the similarity between two different objects. Situations where the object to be retrieved is not the same as the object in the query, but something similar in the perceptual sense (for instance a query of the type: ‘Find me all images with something/someone which looks like this one’), could be dealt with more efficiently. Eventually the user can refine the query based on the results, by asking for more images ‘like’ one of the images retrieved, if that image happens to be what he/she is really seeking. User interaction and feedback can also be exploited so that the system learns how to best satisfy the user’s preferences and interests [A101, A55]. Image features are most commonly organised into n-dimensional feature vectors. Thus the image features of a query and a stored item can be compared by evaluating the distance between the corresponding feature vectors in an n-dimensional feature space. The Euclidean distance is a commonly used and simple metric. Statistical distances such as the Mahalanobis distance have also been used [A44, A12, A77]. When different image properties are indexed separately, similarity or matching scores may be obtained for each property. Then the overall similarity criterion may be obtained by linearly combining individual scores [A59, A20, A7]. The weights for this linear combination may be user-specified. Thus the user can put more emphasis on one particular visual property.

Specific distances have been defined for histograms. A commonly used measure is the quadratic distance [A82]. This is a similarity matrix which accounts for the perceptual difference between any two bins in the histogram. [A82] applies this distance for binary sets as well, and develops an interesting indexing scheme as a result. Another famous histogram similarity measure is the histogram intersection introduced by Swain and Ballard [A89]. In [A88] and [A87], Stricker and Orengo present theoretical analysis of the possibilities and limitations of histogram-based techniques. The abilities of the L1-and L2-norms to distinguish between two histograms are compared: the L1-norm is shown to have a higher discrimination power than the L2-norm. The sparseness of the histograms also influences the retrieval results. Other types of functions are explored by Gevers for his colour-metric pattern-cards [A25]. One of them is worth citing as it seems to have gained popularity: the Hausdorff distance. In [A71], the Hausdorff distance is used to compare the shapes of two objects. All these metrics can quantify the similarity between the properties of images in the database (this is useful when some image clustering is performed off-line for faster/easier retrieval on-line), as well as the similarity between a query and a stored image. According to [A60], the retrieval performance of a system will depend on the agreement between the functions or metric used and human judgements of similarity. A similar view is taken by Santini and Jam [A74, A73]. Santini and Jam in fact aim at developing a set of similarity measures which would closely agree with human perception of similarity from the results of psychology experiments. They propose to extend the contrast model developed by Twersky [A94] to fuzzy sets. This contrast model implies that similarity ordering can be expressed as a linear combination of the common elements and the distinctive features of the two stimuli to compare.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

8. Audio-based and Audio-assisted Semantic Content Analysis On the basis that humans attend to understand the semantic meanings of a multimedia document by deciphering clues from all the sensors available, there is now an upsurge of interest in using multi-modal approaches to automated multimedia content analysis [E3][E18][E20]. The systems associated have been designed to extract and integrate information in a coherent manner from two or more of the signal sources including audio, image frames, closed-captions, text superimposed on images, a variety of objects, e.g. humans, faces, animals, in the hope of revealing the underlying semantics of a media context. In this section we concern ourselves mainly with “audio-based” and “audio-assisted” multimedia (video) content analysis. By “audio” it is referred to as the classes of signals such as Speech, Music and Sound Effects (shots, explosions, door slams, etc.) and their combinations. In the context of multimedia content understanding, the term of “audio-based” means exclusively the use of information gathered from audio (acoustic signals) for scene/event segmentation and classification [E11][E14][E27][E28][E29]. Whilst the term “audio-assisted” focuses on the practice of using audio-visual (AV) features in conjunction for effective and robust video scene analysis and classification. This second trend has since attracted a flush of research activities and interesting applications, see e.g. [E1][E5][E6][E13][E17]. An excellent recent review by Wang et al. can be found in [E26].

It has been recognised over recent years that, as well as visual signal mode (pictorial component), the audio signal mode accompanying the pictorial component often plays an essential role in understanding video content. In some cases the results from audio analysis is more consistent and robust when particular genre and/or applications are concerned. In consistent with the objectives and applications of “visual-based” approaches to content understanding, the audio-based and audio-assisted approach also comprises similar line of research activities.

In the following discussions we briefly review some of the above research findings and promising results, respectively.

8.1 Audio Feature Extraction

The extraction of efficient and discriminatory acoustic features from the audio mode, on a ‘short-term’ frame level and ‘long-term’ clip level to summarise the stationery and temporal behaviours of the audio characteristics.

Feature extraction is a critical process to the success of a content-based video analysis system, there have been extensive studies carried out to address these issues usually on a case by case basis. Though MPEG7 audio standards have assembled a collection of generic low-level tools and application specific tools that provide a rich set of audio content descriptors [E12][E25]. An example of using some of these descriptors for generic sound recognition is described by Casey [E2], which can be adapted to wide applications. One application can be found in [E3] to classify ‘male speech’, ‘female speech’, and ‘non-speech’ segment to help identify newscaster in News programmes, leading to accurate story unit segmentation and news topic categorisation. A good selection of acoustic features have been studied by Liu et al. [E10], including those derived from volume, zero-crossing rate, pitch, and frequency, for both short-term audio frame and long-term clip. These features were successfully used in a system for scene segmentation and genre classification [E8]. Also, in [E15] Roach et al. have used mel-frequency cepstrum coefficients (MFCC) and their first-order dynamic changes effectively for video genre classification. Boreczky and Wilcox also adopted MFCC in their work for video sequence segmentation [E1]. In [E22] features accounting for human cochlea models are introduced.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

8.2 Audio-based Content Analysis

As mentioned before there are cases in a multimedia document where only the sound track is retained, or the sound track is of primary interest, either carrying more consistent and robust semantic meanings or being computationally much simpler to handle. In such cases single audio-mode based content analysis is in a better position to perform the envisaged tasks. Examples include music genre classification [E24], special sounds detection (shots, explosion, violence etc) in a video programme [E11], and many others, e.g. [E10] [E22][E28].

Having extracted a number of audio features as previously discussed, Liu et al [E10] first performed statistical analysis in the feature space, then employed a neural network or hidden Markov model [E8] to learn the inherent structure of contents belonging to different TV programme genre. The result is a system for genre classification and even scene break identification. Further on audio scene segmentation, Sundaram and Chang [E22] proposed an elaborate framework that taking account of dominant sounds changes, multiple features contributions, listener models etc at different time scales, with a view to derive consistent semantic segments. In [E29] audio recordings are segmented and classified into basic audio types such as silence, speech, music, song, environmental sound, speech with the music background, environmental sound with the music background, etc. Morphological and statistical analysis for temporal curves of some basic features are performed to show differences among different types of audio. A heuristic rule-based procedure is then developed to segment and classify audio signals by using these features.

8.3 Audio-assisted Content Analysis

The use of audio features in conjunction with visual features for video segmentation and classification is the subject of extensive studies. In most application scenarios, this is a natural and sensible way forward [E6][E26], compared with single-mode based approach. Once a set of representational features for audio and visual components has been extracted that potentially encapsulate semantic meanings, the syntactic structure (shots, scenes) and semantic concept (story, genre, indoor, car chasing etc) of a video can be analysed. It can be performed using concatenated audio-visual feature vectors while employing appropriate probabilistic and temporal modelling techniques [E1][E8]. Alternatively domain-specific heuristics can be used to generate a consistent outcome from separate audio and visual analysis results through hypothesis and verification. There are a variety of applications [E4] that have been attempted, including video skimming [E19], highlight detection [E6][E16], scene segmentation and classification [E17][E23], genre classification [E13][E15].

9. Content characterization of sports programs

The efficient distribution of sports videos over various networks should contribute to the rapid adoption and widespread usage of multimedia services, because sports video appeal to large audiences. The valuable semantics in a sports video generally occupy only a small portion of the whole content, and the value of sports video drops significantly after a relatively short period of time [H1]. The design of efficient automatic techniques suitable to semantically characterize sports video documents is therefore necessary and very important.

Compared to other videos such as news and movies, sports videos have well defined content structure and domain rules. A long sports game is often divided into a few segments. Each segment in turn contains some sub-segments. For example, in American football, a game contains two halves, and each half has two quarters. Within each quarter there are many plays,

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

and each play start with the formation in which players line up on two sides of the ball. A tennis game is divided into sets, then games and serves. In addition, in sports video, there are a fixed number of cameras in the field that result in unique scenes during each segment. In tennis, when a serve starts, the scene is usually switched to the court view. In baseball, each pitch usually starts with a pitching view taken by the camera behind the pitcher.Furthermore, for TV broadcasting, there are commercials or other special information inserted between game sections [H2].

To face the problem of semantic characterization of a multimedia documents, a human being uses his/her cognitive skills, while an automatic system can face it by adopting a two-step procedure: in the first step, some low-level features are extracted in order to represent low-level information in a compact way; in the second step, a decision-making algorithm is used to extract a semantic index from the low-level features.

To characterize multimedia documents, a lot of different audio, visual, and textual features have been proposed and discussed in literature [H3], [H4], [H5]. Specifically the problem of sport content characterization has been given a lot of attention.

For soccer video, for example, the focus was placed initially on shot classification [H6] and scene reconstruction [H7]. More recently the problems of segmentation and structure analysis have been considered in [H8], [H9], whereas the automatic extraction of highlights and summaries have been analyzed in [H10], [H11], [H12], [H13], [H14], [H15], [H16]. In [H15], for example, a method that tries to detect the complete set of semantic events which may happen in a soccer game is presented. This method uses the position information of the player and of the ball during the game as input, and therefore needs a quite complex and accurate tracking system to obtain this information.

As far as baseball sequences are concerned, the problem of indexing for video retrieval has been considered in [H17], whereas the extraction of highlights is addressed in [H18], [H19], [H2].

The indexing of formula 1 car races is considered in [H20], [H21], and the proposed approach uses audio, video and textual information.

The analysis of tennis videos can be found, for example, in [H2], [H22], whereas basketball and football are considered in [H23], [H24], [H25], and [H26] respectively, to give few examples.

In this section we analyze some techniques proposed in literature for content characterization of sports videos. The analysis focus on the typology of the signal (audio, video, text, multi-modal, ...) from which the low-level features are extracted.

9.1 General considerations on the characterization of sports videos

The analysis of the methods proposed in literature for content characterization of sports documents could be addressed in various ways. A possible classification could be based, for example, on the type of sport considered, e.g., soccer, baseball, tennis, basketball, etc. Another possibility could be to consider the methodology used by the characterization algorithm, e.g., deterministic versus statistical approach, to give two possible examples.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

In this section we have analyzed the various techniques from the point of view of the typology of the signal (audio, video, text, ...) from which the low-level features involved in the process of document characterization are extracted.

Considering the audio signal, the related features are usually extracted in two levels: short-term frame-level, and long-term clip-level [H4].

The frame-level features are usually designed to capture the short-term characteristic of the audio signal, and the most widely used have been:

1) Volume ("loudness" of the audio signal); 2) Zero Crossing Rate, ZCR (number of times that the audio waveform crosses the zero

axis); 3) Pitch (fundamental frequency of an audio waveform); 4) Spectral features (parameters that describes in a compact way the spectrum of an audio

frame).

To extract the semantic content, we need to observe the temporal variation of frame features on a longer time scale.This consideration has lead to the development of various clip-level features, which characterize how frame-level features change over a clip [H4].

These clip-level features are based on the frame-level features, and the most widely used have been:

1) Volume based, mainly used to capture the temporal variation of the volume in a clip;2) ZCR based, usually based on the statistics of ZCR;3) Pitch based;4) Frequency based, that reflect the frequency distribution of the energy of the signal.

Related to the audio signal, there are also the techniques which try to detect and interpret some specific keywords pronounced by the speaker that comments the sports video. This type of information is usually very useful, even if it is very difficult to obtain.

Considering the visual signal, the related features can be categorized into four groups, namely: color, texture, shape, and motion.

1) Color: Color is an important attribute for image representation, and the color histogram, which represent the color distribution in an image, is one of the most used color features.

2) Texture: Texture also is an important feature of a visible surface where repetition or quasi-repetition of a fundamental pattern occurs.

3) Shape: Shape features, that are related to the shape of the objects in the image, are usually represented using traditional shape analysis such as moment invariants, Fourier descriptors, etc.

4) Motion: Motion is an important attribute of video. Motion features, such as moments of the motion field, motion histogram, or global motion parameter have been widely used.

Another important aspect of the analysis of the video signal is the basic segment used to extract the features, that can be composed by one or few images, or by an entire video shot.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Related to the image and video analysis, there are also the techniques in which the textual captions and logos superimposed on the images are detected and interpreted. This captions usually carry a significant semantic information that can be very useful if available [H27], [H28].

In the next subsections we will describe some techniques based on visual information, then some methods that analyze audio information, and finally the techniques which consider both audio and visual information, in a multi-modal fashion.

9.2 Techniques based on visual information

In this subsection we describe some techniques of content characterization of sports videos that uses features extracted mainly from the image and video signal. To have a more complete description of the features proposed for content analysis based on image and video, refer to [H4], [H5].

Baseball and tennis video analysis

Di Zhong and Shih-Fu Chang at ICME'2001 [H2] proposed a method for the temporal structure analysis of live broadcast sport videos, using as examples tennis and baseball sequences. Compared to other videos such as news and movies, sports videos have well defined content structure and domain rules. A long sports game is often divided into a few segments. Each segment in turn contains some sub-segments. For example, a tennis game is divided into sets, then games and serves. In tennis, when a serve starts, the scene is usually switched to the court view. In baseball, each pitch usually starts with a pitching view taken by the camera behind the pitcher.The main objective of the work presented in [H2] is the automatic detection of fundamental views (e.g., serve and pitch) that indicates the boundaries of higher level structures. Given the detection results, useful applications such as table of contents and structure summaries can be developed. In particular, in the considered work [H2], the re-current event boundaries, such as pitching and serving views are identified, by using supervised learning and domain-specific rules. The proposed technique for detecting basic units within a game, such as serves in tennis and pitching in baseball, uses the idea that these units usually starts with a special scene. Mainly a color based approach is used, and to achieve higher performance, an object-level verification to remove false alarms was introduced. In particular spatial consistency constraints (color and edge) are considered to segment each frame into regions. Such regions are merged based on proximity and motion. Merged regions are classified into foreground moving objects or background objects based on some rules of motions near region boundaries and long-term temporal consistency. One unique characteristic of serve scenes in tennis game is that there are horizontal and vertical court lines. Thedetection of these lines is taken into account to improve the performance of the identification algorithm.

The analysis of Tennis video is also carried out in 2001 by Petkovic et al. [H22]. They propose a method for automatic recognition of strokes in tennis videos based on Hidden Markov Model. The first step is to segment the player from the background, then HMMs is

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

trained to perform the task. The considered features are dominant color, and shape description of the segmented player, and the method appear to lead to satisfactory performance.

The problem of highlights extraction in baseball game videos has been further considered in ICIP'2002 by P. Chang et al. [H19]. In particular a statistical model is built up in order to explore the specific spatial and temporal structure of highlights in broadcast baseball game videos. The proposed approach is based on two observations. The first is that most baseball highlights are composed of certain types of scene shots, which can be divided into a limited amount of categories. The authors identified seven important types of scene shots, with which most interesting highlights can be composed.These types of shots are defined as: 1) pitch view, 2) catch overview, 3) catch close-up, 4) running overview, 5) running close-up, 6) audience view and 7) touch-base close-up. Although the exact video streams of the same type of scene shots differ from game to game, they strongly exhibit common statistical properties of certain measurements due to the fact that they are likely to be taken by the broadcasting camera mounted at similar locations, covering similar portions of the field, and used by the cameraman for similar purposes, for example, to capture the overview of the outer field, or to track a running player. As previously mentioned, most highlights are composed of certain types of shots, and the second observation is that the context of transition of those scene shots usually implies the classification of the highlights. In other words, same type of highlights usually have similar transition pattern of scene shots. For example, a typical home run can be composed of a pitch view followed by an audience view and then a running close-up view.The features used in [H19] have been edge descriptor, grass amount, sand amount, camera motion and player height. Of course the context of all the home runs can vary but they can be adequately modelled using a Hidden Markov Model (HMM). In the proposed system an HMM model for each type of highlights is learned. A probabilistic classification is then made by combining the view classification and the HMM model. In summary, the proposed system first segments a digitized game video into scene shots. Each scene shot is then compared with the learnt model, and its associated probability is then calculated. Finally given the stream of view classification probabilities, the probability of each type of highlights can be computed by matching the stream of view classification probabilities with the trained HMMs. In particular the following highlights have been considered: "home run", "catch", "hit", "infield play", and the simulation results appear quite satisfactory.

Soccer video analysis

Particular attention has been devoted in the literature to the problem of content characterization of soccer video. In soccer video, for example, each play typically contains multiple shots with similar color characteristics. Simple clustering of shots would not reveal high-level play transition relations. Moreover, soccer video does not have canonical views (e.g., pitch) indicating the event boundaries. Due to these considerations, specific techniques have been developed for the analysis of this type od sport video sequence.

In ICME'2001, S.-F. Chang et al. [H8] proposed an algorithm for structure analysis and segmentation of soccer video. Some works on sport video analysis and video segmentation are using shot as the base for analysis. However, such approach is often ineffective for sports video due to errors in shot detection, and the lack of or mismatch of domain-specific temporal structure. Starting from this consideration, in [H8], instead of using the shot-based framework, a different approach is proposed, where frame-based domain-specific features are classified into mid-level labels through unsupervised learning, and temporal segmentation of

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

the label sequences is used to automatically detect high-level structure. Moreover, fusion among multiple label sequences based on different features are used to achieve higher performance. In particular, the high level structure of the content is revealed using the information related to the fact that the ball is in play or not. The first step is to classify each sample frame into 3 kinds of view (mid-level labels: global, zoom-in, and close-up) using a unique domain-specific feature, grass-area ratio. Then heuristic rules are used in processing the view label sequence, obtaining play/break status of the game.

The previously described work have been further refined in [H9], where an algorithm for parsing the structure of produced soccer programs is proposed. At first two mutually exclusive states of the game are defined, play and break. A domain-tuned feature set, dominant color ratio and motion intensity, is selected, based on the special syntax and content characteristic of soccer videos. Each state of the game has a stochastic nature that is modelled with a set of hidden Markov models. Finally standard dynamic programming techniques are used to obtain the maximum likelihood segmentation of the game into the two states.

Ekin and Tekalp [H16] in SPIE'2003 proposed a framework for analysis and summarization of soccer videos using cinematic and object-based features. The proposed framework includes some novel low-level soccer video processing algorithm, such as dominant color region detection, robust shot boundary detection, and shot classification, as well as some higher-level algorithms for goal detection, referee detection and penalty-box detection. The system can output three types of summaries: 1) all slow motion segments in a game; 2) all goals in a game; 3) slow-motion segments classified according to object features.The first two types of summaries are based on cinematic features only for speedy computational efficiency, while the summaries of the last type contain higher-level semantics. In particular the authors propose new dominant color region and shot boundary detection algorithms that are robust to variations in the dominant color, to take into account the fact that the color of the grass field may vary from stadium to stadium, and also as a function of the time of the day in the same stadium. Moreover the algorithm proposed for goals detection is based solely on cinematic features resulting from common rules employed by the producers after goal events to provide a better visual experience for TV audiences.Distinguishing jersey color of the referee is used for referee detection. Penalty box detection is based on the three-parallel-line rule that uniquely specifies penalty box area in a soccer field. Considering for example the algorithm for goal detection they define a cinematic template that should satisfy the following requirements. Duration of the break: a break due to a goal lasts no less than 30 and no more than 120 seconds. The occurrence of at least one close-up/out of field shot: this shot may either be a close-up of a player or out of field view of the audience. The existence of at least one slow-motion replay shot: the goal play is most often replayed one or more times. Therelative position of the replay shot: the replay shot follow the close-up/out of field shot.

The problem of highlights extraction in soccer video has been considered also by Leonardi, Migliorati et al. [H10], [H11], [H12], [H13].In [H10], and [H11] the correlation between low-level descriptors and the semantic events in a soccer game have been studied. In particular, in [H10], it is shown that low-level descriptors are not sufficient, individually, to obtain satisfactory results (i.e., all semantic events detected with only a few false detections). In [H11], and [H13] the authors have tried to exploit the temporal evolution of the low-level descriptors in correspondence with semantic events, by proposing an algorithm based on a finite-state machine. This algorithm gives good results in

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

terms of accuracy in the detection of the relevant events, whereas the number of false detections remains still quite large.The considered low-level motion descriptors, associated to each P-frame, represent the following characteristics: lack of motion, camera operations (pan and zoom parameters), and the presence of shot-cuts. The descriptor ``Lack of motion" has been evaluated by thresholding the mean value of motion vector module. Camera motion parameters, represented by horizontal ``pan" and ``zoom" factors, have been evaluated using a least-mean squares method applied to P-frame motion fields. Shot-cuts have been detected through sharp transition of motion information and high number of Intra-Coded Macroblocks of P-frames.The above mentioned low-level indices are not sufficient, individually, to reach satisfactory results. To find particular events, such as, for example, goals or shot toward goal, it is suggested to exploit the temporal evolution of motion indices in correspondence with such events. Indeed in correspondence with goals a fast pan or zoom often occurs followed by lack of motion, followed by a nearby shot cut. The concatenation of these low-level events is adequately modelled with a finite-state machine.The performance of the proposed algorithm have been tested on 2 hours of MPEG2 sequences. Almost all live goals are detected, and the algorithm is able to detect some shots toward goal too, while it gives poor results on free kicks. The number of false detection remains high.

Replay segment identification

In the analysis of sports videos it is sometimes important the identification of the presence of a slow-motion replay.

In ICASSP'2001, Sezan et al. [H28] presented and discussed an algorithm for the detection of slow-motion replay segments in sports video. Specifically the proposed method localizes semantically important events by detecting slow-motion replays of these events, and then generates highlights of these events at different levels. In particular a Hidden Markov Model is used to model the slow-motion replay, and an inference algorithm is introduced which computes the probability of a slow motion replay segment, and localizes the segment boundaries.Four features are used in the HMM, three of which are calculated from the pixel-wise mean square difference of the intensity of every two subsequent fields, and one of which is computed from the RGB color histogram of each field. The first three features describe the still, normal motion replay, and slow motion fields. The fourth feature is for capturing the gradual transition in editing effects.

An evolution of the previously described work has been presented at ICASSP'2002 [H29], where an automatic algorithm for replay segment detection by detecting frames containing logos inthe special scene transition and sandwich replays. The proposed algorithm first automatically determines the logo template from frames surrounding slow motion segments, then, it locates all the similar frames in the video using the logo template. Finally the algorithm identifies the replay segments by grouping the detected logo frames and slow motion segments.

9.3 Techniques based on the audio signal

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signal may actually play an important role in content parsing for many applications. In this section we describe some techniques of content characterization of sports videos that uses features extracted mainly from the audio signal associated to the multimedia document. To have a more complete description of the techniques proposed for content analysis based on audio signal, please refer to [H3].

The first example that we consider is related to the characterization of baseball videos, and was proposed in ACM Multimedia 2000, by Rui et al [H18].In this work the detection of highlights in baseball programs is carried out considering audio-track features alone without relying on expensive to compute video-track features. The characterization is performed considering a combination of generic sports features and baseball specific features, combined using a probabilistic framework. This way highlights detection can even be done on the local set-top box using limited computing power.The audio track consists of the presenter's speech, mixed with crowd noise, mixed with remote traffic and music noises, and automatic gain control changing the audio level. To have an idea of the feature taken into account, they use the bat-and-ball impact detection to adjust likelihood of a highlight segment, and therefore the same technology could in principle be used also for other sports like golf. In particular, the audio features considered have been: Energy related features, Phoneme-level features, Information complexity features, Prosodic features. These features are used for solving different problems. Specifically some of them are used for human speech endpoint detection, other are used to built a temporal template to detect baseball hits or to model exited human speech. These features have been suitably modelled using a a probabilistic framework.The performance of the proposed algorithm are evaluated comparing its output against human-selected highlights for a diverse collection of baseball games. They appear very encouraging.

To give another example, we consider the segmentation in three classes of the audio signal associated to a Football audio-video sequence proposed in IWDSC'2002, by Lefevre et al. [H26]The audio data is divided into short sequences (typically with duration of one or half a second) which will be classified into several classes (speaker, crowd, referee whistle). Every sequence can then be further analyzed depending on the class it belongs to. Specifically the method proposed uses Cepstral analysis and Hidden Markov Models. The results presented in terms of accuracy in the three classes segmentation are good.

9.4 Techniques based on multi-modal analysis

In the previous sections we have considered some approaches based on the analysis of the audio signal or the image and video signal alone. In this section we will consider some examples of algorithms that use both audio and visual cues, in order to exploit the full potentiality of the multimedia information. To have a more complete description of the features proposed for content analysis based on both audio and visual signal, please refer to [H4], [H5].

Baseball audio-visual analysis

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Gong et al. [H30] at ACM Multimedia 2002 proposed an integrated baseball digest system. The system is able to detect and classify highlights from baseball game videos in TV broadcast.The digest system gives complete indices of a baseball game which cover all status changes in a game. The result is obtained by combining image, audio and speech clues using a maximum entropy method.The image features considered are the color distribution, the edge distribution, the camera motion, the player detection, and the shot length. Considering the color distribution, the authors observe that every sport game has a typical scene, such as pitch scene in baseball, corner-kick scene in soccer, serve scene in tennis. Color distribution in individual image games are highly correlated for similar scenes. Given the layout of grass and sand in a scene shot of a baseball video, it is easy to detect where the camera is shooting from. Considering the edge distribution, this feature is useful to distinguish audience scenes from field scenes. The edge density is always higher in audience scene, and this information is used as an indicator of the type of the current scene.Another feature that is considered is the camera motion, estimated using a robust algorithm. Also the player detection is considered as a visual feature. In particular the players are detected considering color, edge and texture information, thus the maximum player size and the number of players are the features associated to each scene shot.The authors consider also some audio features. In particular the presence of silence, speech, music, hail and mixture of music and speech in each scene shot is detected. To perform this task they use the Mel-cepstral coefficients as the features modelled using a Gaussian Mixture Models. Considering that closed captions provide hints for the presence of highlights, the authors suggest to extract informative words or phrases from closed captions. From the training data, a list of 72 informative words are chosen, such as field, center, strike, etc. The multimedia features are then fused using an algorithm based on the Maximum Entropy Method to perform the highlights detection and classification.

Formula 1 car races audio-visual analysis

Petkovic et al. [H20] at ICME'2002 proposed an algorithm for the extraction of highlights from TV Formula 1 programs. The extraction is carried out considering a multi-modal approach that uses audio, video and superimposed text annotation combined by Dynamic Bayesian Networks (DBN).In particular, four audio features are selected for speech endpoint detection and extraction of excited speech, namely: Short Time Energy (STE), pitch, Mel-Frequency Cepstral Coefficients (MFCC) and pause rate. For the recognition of specific keywords in the announcer's speech a keyword-spotting tool based on a finite state grammar has been used.In the visual analysis, color, shape and motion features are considered. First the video is segmented into shots based on the differences of color histogram among several consecutive frames. Then the amount of motion is estimated and semaphore, dust, sand and replay detectors are applied in order to characterize passing, start and fly-out events.The third information source used in the processing is the text that is superimposed on the screen. This is another type of on-line annotation done by the TV program producer, which is intended to help the viewer to better understand the video content. The superimposed text often brings some additional information that is difficult or even impossible to deduce solely by looking at the video signal.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

The details of this algorithms are described in [H21]. The reported results show that the fusion of cues from the different media has resulted in a much better characterization of Formula 1 races. The audio DBN was able to detect a great number of segments where the announcer raised his voice, which corresponds to only the 50% of all interesting segments, i.e., highlights in the race. The integrated audio-visual DBN was able to correct the result and detect about 80% of all interesting segments in the race.

Soccer audio-visual analysis

Leonardi et al. [H14] at WIAMIS'2003 presented a semantic soccer-video indexing algorithm that uses controlled Markov chains [H31] to model the temporal evolution of low-level video descriptors [H12]. To reduce the number of false detections given by the proposed video-processing algorithm, they add the audio signal characteristics. In particular they have evaluated the "loudness" associated to each video segments identified by the analysis carried out on the video signal. The intensity of the "loudness" has then been used to order the selected video segments. In this way, the segments associated to the interesting events appear in the very first positions of the ordered list, and the number of false detections can be greatly reduced.The low-level binary descriptors, associated to each P-frame, represent the following characteristics: lack of motion, camera operations (pan and zoom parameters), and the presence of shot-cuts, and are the same descriptors used in [H11]. Each descriptor takes value in the set {0, 1}.The components of a controlled Markov chain model are the state and input variables, the initial state probability distribution, and the controlled transition probability function. We suppose that the occurrence of a shot-cut event causes the system to change dynamics. In order to model this fact, we describe the state of the system as a two-component state, and also, we impose a certain structure on the controlled transition probability function.We suppose that each semantic event takes place over a two-shot block and that it can be modeled by a controlled Markov chain with the structure described above. Each semantic event is then characterized by the two sets of probability distributions over the state space. Specifically, we have considered 6 models denoted by A, B, C, D, E, and F, where model A is associated to goals, model B to corner kicks, and models C, D, E, F describe other situations of interest that occur in soccer games, such as free kicks, plain actions, and so on.On the basis of the derived six Controlled Markov models, one can classify each pair of shots in a soccer game video sequence by using the maximum likelihood criterion. For each pair of consecutive shots (i.e., two consecutive sets of P-frames separated by shot-cuts), one needs to:

i) extract the sequence of low-level descriptors, ii) determine the sequence of values assumed by the state variable, and iii) determine the likelihood of the sequence of values assumed by the low-level

descriptors according to each one of the six admissible models. The model that maximizes the likelihood function is then associated to the considered pair of shots.

The performance of the proposed algorithm have been tested considering about 2 hours of MPEG2 sequences containing more than 800 shot-cuts, and the results are very promising. The number of false detections are still quite relevant. As the results are obtained using motion information only, it was decided to reduce the false detection associating to the candidates pairs shots the audio loudness.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

To extract the relevant features, we have divided the audio stream of a soccer game sequence in consecutive clips of 1.5 seconds, in order to observe a quasi-stationary audio signal in this window [H4]. For each frame the "loudness" is estimated as the energy of the sequence of audio samples associated to the current audio-frame. The evolution of the "loudness" in an audio clip follows the variation in time of the amplitude of the signal, and it constitutes therefore a fundamental aspect for audio signal classification. We estimate the mean value of the loudness for every clip. In this way we obtain, for each clip, a low-level audio descriptor represented by the "clip loudness".The false detections given by the candidate pairs of shots obtained by video processing are reduced by ordering them according to the average value of the "clip loudness" along the time span of the considered segment. In this way, the video segments containing the goals appear in the very fist positions of this ordered list. The simulation results appear to be very encouraging, reducing the number of false detection by an order of magnitude.

9. 5 Discussion on the considered algorithms

The analysis of the techniques for sports video characterization suggests an important consideration about the modality in which the interesting event are captured by the algorithms. We can clearly see that the characterization is carried out either considering what happens in a specific time segment, observing therefore the features in a "static" way, or trying to capture the "dynamic" evolution of the features in the time domain.

In tennis, for example, if we are looking for a serve scene, we can take into account the specific, well known situation, in which we have the player in great evidence in the scene, and so we can try to recognize its shape, as suggested in [H22].In the same way, if we are interested in the detection of the referee in a soccer game video, we can look for the color of the referee jersey, as suggested in [H16]. Also the detection of a penalty box can be obtained by the analysis of the parallel field lines, which is clearly a "static" evaluation.Considering for example formula 1 car races, calculating the amount of motion, and detecting the presence of the semaphore, dust, and sand, in [H20], the start and fly-out events are characterized.In all these examples, it is clear that is a "static" characteristic of the interested highlights that is captured by the automatic algorithm. Similar situation occurs in baseball, if we are interested for example in pitch event detection, and we consider that each pitch usually starts with a pitching view taken by the camera behind the pitcher.

On the other hand, if we want for example to detect the goals in a soccer game, we can try to capture the "dynamic" evolution of some low-level features, as suggested in [H11], [H13], or try to recognize some specific cinematic patterns, as proposed in [H16]. In the same way, we are looking to a "dynamic" characteristic if we want to determine automatically the slow-motion replay segments, as suggested in [H29].

The effectiveness of each approach depends mainly on the kind of sports considered, and from the type of highlights we are interested in.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

10. Content-Based Indexing and Retrieval SystemsMany content based indexing and retrieval systems have been developed over the last decade. However, no system or technology has yet become widely pervasive. Most of these systems are currently available for general and domain specific use. A review of the most relevant systems is given in this section focusing on the varying functionalities that have been implemented in each system.

Looking at the theoretical side a number of visual systems have been proposed for the retrieval of multimedia data. These systems fall broadly under four categories: query by content [C70, C72, C73, C74, C75], iconic query [C69, C71], SQL query [C77, C78], and mixed queries [C68, C71, C76, C79]. The query by content is based on images, tabular form, similarity retrieval (rough sketches) or by component features (shape, color, texture). The iconic query represents data with ‘look alike’ icons and specifies a query by the selection of icons. SQL queries are based on keywords, with the keywords being conjoined with the relationship (AND, OR) between them, thus forming compound strings. The mixed queries can be specified by text and as well as icons. All of these systems are based on different indexing structures.

In his excellent review of current content based recognition and retrieval systems, Paul Hill [C80] describes the most relevant systems in terms of both commercial academic availability can be classified according to the used database population, query techniques and indexing features. Here database population refers to actual process of populating a database. The systems reviewed in this section are: QBIC, Photobook, Netra & Netra-V, Virage, Webseek, Islip/Infomedia and Fast Multiresolution Image Query. Additional systems which are considered to be less important but may have interesting and novel query interfaces are: ViBE, Multi-linearization data structure for image browsing, A user interface framework for image searching and Interfaces for emergent semantics in multimedia databases. Although, this list is by no means exhaustive it considers the key players in the area.

IBM’s Query by Image Content (QBIC) has been cited as a primary reference by most of the reviewed systems. It has gained this position from being one of longest available commercial systems and probably the most pervasive. The system is available in several different versions ranging from an evaluation version with limited functionality, to a full version described in [C81, C82].

Netra [C83] is an image retrieval system developed at UC Santa Barbara. It is an entirely region based system that uses colour, texture, shape and spatial location information for indexing and discrimination. This is achieved by initially segmenting the images in the database population stage using a robust image segmentation technique developed by the authors known as the edge flow technique. Each region is then indexed using features extracted from each region.

Netra-V [C84] extends the Netra system to video where the regions are obtained not just by spatial segmentation but by spatio-temporal segmentation. Image segmentation of input images is based on an edge flow technique previously developed by the authors [C85]. It is claimed that accurate segmentation is achieved by a minimum number of parameter inputs.

A related system in terms of query functionality is the Chabot [C86] system developed by UC Berkley. Within the Netra system, a texture keyword can be used to pose a query. Similarly within the Chábot system a concept can be used (and defined) that combines other forms of query and is associated with a referenced term such as flower garden etc. Artisan [C92] developed at the University of Northumbria is a purely shape based retrieval system that

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

attempts to classify shapes into “Boundary Families”, defined in terms of perceptually guided attributes such as collinearity, proximity and pattern repetition.

The Fast Multiresolution Image Query system has been produced by Washington University specifically to enable efficient similarity searches on image databases employing a user produced sketch or a low resolution scanned copy of a query image [C87]. A novel database organisation is also employed in order to accelerate the query from large image databases. Regarding this system a few significant papers have been produced [C88, C89, C90] that report a similar use of wavelet enabled search and offer a variation in techniques in terms of transform and indexing. However, this system is representative of the general technique and highlights the use of an interactive and fast user interface system [C91].

WebSeek [C94] is a video and image cataloguing and retrieval system for the world-wide web developed at Columbia University. It automatically collects imaging data from the web and semi-automatically populates its database using an extendible subject taxonomy. Text and simple content based features (colour histograms) are used to index the data to facilitate an iterative and interactive query method based on a Java and HTML search engine.

VisualSeek [C93] was also produced by Columbia University and appears to be an extension to WebSeek. It provides distinctly different functionality. Firstly it has not been specifically intended as a web-image search engine. Secondly it segments images to enable local and spatial queries. Segmentation uses the back-projection of binary colour sets. This technique is used not only for the extraction of colour regions but also for their representation.

Other related systems are ImageRover [C96] from Boston University and WebSeer [C95] from the University of Chicago. ImageRover offers a very similar functionality with a QBIC type query system offering combined relevance I non-relevance queries. WebSeer differs in that it tries to automatically distinguish between photographs and graphics and catalogues only what it considers to be photos.

MIT’s Photobook [C97] proposes a different solution to the problem of image retrieval from many of the other systems described in this review. It centres its attention not on purely discriminant indices but instead on “semantic-preserving image compression”. This means that the images are not accompanied by purely discriminatory based index coefficients (e.g. colour histograms) but by compressed versions of the image. These compressed versions of the images are not compressed in terms of coding efficiency (in a rate-distortion sense) but instead in terms of semantic information preservation.

Regarding query techniques and user interfaces, most of them starts with the user selecting the category of image they wish to examine (e.g. tools, cloth samples etc.). These categories are constructed from the text annotations using an AI database called Framer [C100]. Further query functionalities are: the entire set of images is sorted in terms of their similarity to a query image and a most similar subset is displayed as a results list; a single image and combinations of images can be used as the next similarity query; a selection of matching models [C104, C105, C106, C107].

Videobook [C98] is also produced at MIT and is an extension of the Photobook system to video, it shares little functionality with Photobook. Instead, it segments videos into 64x64x16 blocks from which 8 dimensional feature vectors are extracted. This vector is comprised of motion, texture and colour information and is used to index videos in database population. A chosen block is used as a query and a Euclidean distance is used to find similar blocks in the database. VideoQ [C101] system uses eigenvalue analysis to define shape.

Virage [C102, C103] is a commercial organisation that has produced products for content

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

based image and video retrieval. These applications are based on the “Open framework” systems called the Virage Image Engine and the Virage Video Engine. These systems are produced in the form of statically and dynamically linkable libraries accessible by predefined functions produced for a range of platforms. They have a base functionality (e.g. texture and colour primitives) that can be extended by developers to enable them to “plug in” to various systems. Domain-specific retrieval applications can then be produced.

ISLIP / Infomedia project at Carnigie Mellon University has been set up to create a large, online digital library featuring full content and knowledge-based search and retrieval [C108]. This system is available commercially under the name ISL1P.

ViBE [C110] is a video retrieval system with a novel user interface developed at Purdue University. In this system MPEG-2 video sequences are used and temporally segmented using compressed-domain techniques. Shot detection is performed by generating a 12-dimensional feature-vector termed a General Trace. This is derived from the DC coefficients from the DCT blocks formed into a reduced resolution DC image. Half of the 12 features are statistics from these DC images and the other half is from motion information and MPEG structural information. The L2 norm together with a trained classification / regression tree is used for the shot segmentation.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

11. Other commercial content-based image retrieval systemsQuery By Image Content (QBIC) [G1] is an image retrieval system developed by IBM. It was one of the first systems to perform image retrieval by considering the visual content of images rather than textual annotations. QBIC supports queries based on example images, user-constructed sketches, and selected colors and texture patterns. In its most recent version, it allows text-based keyword search to be combined with content-based similarity search. It is commercially available as a component of the IBM DB2 Database System. QBIC is currently in use by the Hermitage Museum, for its online gallery [G2].

Cobion's Premier Content Technology [G3] is a family of software tools targeted at facilitating the retrieval of any kind of content and information. The Content Analysis Library is a member of this family that can be used for indexing and retrieval of images and videos based both on keywords and image visual contents. Core technologies used by Cobion for these purposes include recognition of logos, trademarks, signs and symbols, and image categorization. Cobion products and services are used for providing hosted services to portals [G4, G5, G6], destination sites, corporations and image galleries.

Virage [G7] is a commercial content-based still-image search engine developed at Virage, Inc. Virage supports queries based on color, color layout, texture, and structure (object boundary information) in any combination. The users select the weight values to be associated with each atomic feature according to their own emphasis.

Convera’s Screening Room [G8] is a software tool, which provides video producers with the capability to browse, search and preview all of their video source material. Among its features are storyboard browsing, content cataloguing using annotations and search for precise video clips using text and image clues.

Perception-Based Image Retrieval (PBIR) by Morpho Software [G9] is another image and video search engine. Morpho's PBIR uses perception-based analysis to break each image of the given collection down to more than 150 parameters. Subsequently, users can select the images - either provided by the system or by the users themselves - that best suit their preferences; the system uses their positive and negative feedback to learn what it is they are looking for. By analyzing the characteristics of both selected and unselected images, PBIR infers the user's intention. PBIR-based modules can be integrated into existing databases, web crawlers and text-based search engines.

Evision’s eVe [G10] is a visual search engine that uses object segmentation, to extract indexing features for each object and to allow for object-based functionality, and features a common API for handling still images, video, and audio. A visual vocabulary is automatically generated for a given image collection; then, queries can be submitted with the help of this vocabulary. Color, texture and shape features are employed for evaluating the similarity between the members of the vocabulary used for initiating a query and the images of the collection.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

LTU Technologies [G11] offers two distinct products for categorizing and indexing images and videos: Image-Indexer and Image-Seeker. Both products rely on analyzing an image or video dataflow based on its visual features (shape, color, texture, etc) and subsequently extracting a digital signature of the image or video; this is used either for performing visual queries and refining them with the assistance of relevance feedback (Image-Seeker), or for automatically categorizing the image based on its contents and extrapolating keywords that can be used for querying (Image-Indexer). Major users of LTU Technologies’ image indexing products include the French Patent Office and Corbis [G12].

Lanterna Magica [G13] specializes in video indexing and retrieval. Indexor is used to describe the content of a video production, i.e. divide a video into a set of meaningful segments (whether stories, scenes, short or groups of frames) and describe each video segment with a textual description and extracted representative images. Diggor is used to search an indexed video production. It is a search engine specialised for video; it retrieves the video segments that match the search criteria, and displays their textual description and representative images.

MediaArchive, from Tecmath [G14] is a powerful archiving system allowing storage and retrieval of any Media files. This application is an industrial application of the EU project EUROMEDIA. It is in use at many major broadcasters in Europe.

Media Gateway Suite, from Pictron [G15], features automatic detection of video scene changes and extraction of representative key frames. It enables the summarization of the video content into a storyboard format that communicates the storyline in a highly effective way. Image features such as color, shape, and texture and object features such as human faces, video title text, and user-defined objects are extracted from video frames to index the visual content.

Almedia Gateway and Almedia Publisher are two content management tools produced by Aliope Ltd [G16]. Almedia Gateway automatically encodes and analyses video broadcasts in either live or off-line modes performing both shot boundary detection and keyframe extraction. Almedia Publisher edits and reformats video into segments that are automatically transcoded for delivery over the web and to wireless devices. It also enables multi-level semi-automatic annotation of video in terms of scenes, stories, people, etc.

Acknowledgement: Some parts of this document were produced using the literature review by B. Levienaise-Obadia on Video Database Retrieval Technical Report T8/02/1, University of Surrey. Specifically the Literature review labeled as ‘References A’ was taken from that report.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

12. References12.1 References A

[A1] B. Agnew, C. Faloutsos, Z. Wang, D. Welch, and X. Xue. Multimedia indexing over the web. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 72—83, 1997.

[A2] G. Ahanger and T.D.C. Little. A survey of technologies for parsing and indexing digital videos. Journal of Visual Communication and Image Representation, 7:28—43, March 1996.

[A3] E. Ardizzone, G.A.M. Gioiello, M. La Cascia, and D. Molinelli. A real-time neural approach to scene cut detection. In SPIE Storage and Retrieval for Image and Video Databases IV, 1996.

[A4] F. Arman, A. Hsu, and M.Y. Chiu. Feature management for large video databases. In SPIE: Storage and retrieval of image and video databases, 1993.

[A5] T. Arndt and S.-K. Chang. Image sequence compression by iconic indexing. In IEEE Computer Society IEEE, editor, IEEE Workshop on Visual Languages, pages 177—182, 1989.

[A6] V. Athitsos and M.J. Swain. Distinguishing photographsand graphics on the world wide web. In IEEE, editor, IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997.

[A7] J.R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Humphrey, R. Jam, and C.F. Shu. The virage search engine: An open framework for image management. In SPIE Storage and Retrieval for Still Image and Video Databases IV, Vol.2670, pages 77—87, 1996.

[A8] F. Beaver. Dictionary of Films Terms. Twayne Publishing, New-York, 1994.

[A9] S. Berretti, A. Del Bimbo, and P. Pala. Sensation and psychological effects in color image database. In ICIP’97, pages 560—562, 1997.

[A10] M. Blume and D.R. Ballard. Image annotation based on learning vector quantisation and localised haar wavelet transform features. Technical report, Reticular Systems, Inc., 1997.

[A11] A.S. Bruckman. Electronic scrapbook: Towards an intelligent home-video editing system. Master’s thesis, MIT, 1991.

[A12] C. Carson, S. Belongie, H. Greenspan, and J.Malik. Region-based image querying. In CVPR ‘97 Workshop on Content-Based Access of Image Video Libraries, 1997.

[A13] M. La Cascia and E. Ardizzone. Jacob: Just a content-based query system for video databases. In ICASSP’96, 1996.

[A14] C.-W. Chang and S.-Y. Lin. Video content representation, indexing and matching in video information systems. Journal of Visual Communication and Image Representation, 8, No.2:107—120, June 1997.

[A15] S.-K Chang, Q.-Y. Shi, and C.-W. Yan. Iconic indexing by 2d strings. IEEE Trans. Patt. Anal. Mach. Intell, PAMI-9(3):413—428, May 1987.

[A16] S.F. Chang, W. Chen, H.J. Meng, H. Sundaram, and D. Zhong. Videoq: An automated content based video search system using visual cues. In ACM Multimedia,1997.

[A17] M.G. Christel. Addressing the contents of video in a digital library. Electronic Proceedings of the ACM Workshop on Effective Abstractions in Multimedia, 1995.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[A18] M.G. Christel, D.B. Winkler, and C. Roy Taylor. Multimedia abstractions for a digital library. In ACM Digital Libraries ‘97 Conference, 1997.

[A19] J.M. Corridoni and A. Del Bimbo. Structured digital video indexing. In ICPR ‘96, pages 125—129, 1996.

[A20] Y. Deng and B.S. Manjunath. Content-based search of video using color, texture and motion. In ICIP’97, Vol.1, pages 534—537,?

[A21] Y. Deng, D. Mukherjee, and B.S. Manjunath. Netra-v: Towards an object-based video representation. Technical report, University of California Santa Barbara, 1998.

[A22] N. Dimitrova and M. Abdel-Mottaleb. Content-based video retrieval by example video clip. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 59—70, 1997.

[A23] J.P. Eakins, K. Shields, and J. Boardman. Artisan - a shape retrieval system based on boundary family indexing. In SPIE Storage and Retrieval for Image and Video Databases IV - Vol.2670, pages 123—135, 1996.

[A24] D.A. Forsyth, J. Malik, M.M. Fleck, H. Greenspan, T. Leung, S. Belongie, C. Carson, and C. Bregler. Finding pictures of objects in large collections of images. Technical report, University of Berkerley, 1996.

[A25] T. Gevers and A.W.M. Smeulders. Color-metric pattern-card matching for viewpoint invariant image retrieval. In ICPR ‘96, pages 3—7, 1996.

[A26] Y. Gong, H. Zhang, H.C. Chuan, and M. Sakauchi. An image database system with content capturing and fast image indexing abilities. In mt. Conf. on Multimedia Computing and Systems, pages 121—130, 1994.

[A27] K. Hachimura. Retrieval of paintings using principal color information. In ICPR ‘96, pages 130—134, 1996.

[A28] A. Hampapur, A. Gupta, B. Horowitz, C.F. Shu, C. Fuller, J. Bach, M. Gorkani, and R. Jam. Virage video engine. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 188—198, 1997.

[A29] K.J. Han and A.H. Tewfik. Eigen-image based video segmentation and indexing. In ICIP’97, Vol.1, pages 538—541, 1997.

[A30] F. Idris and S. Panchanathan. Indexing of compressed video sequences. In SPIE Storage and Retrieval for Still Image and Video Databases IV, Vol.2670, pages 247— 253, 1996.

[A31] F. Idris and S. Panchanathan. Review of image and video indexing techniques. Journal of Visual Communication and Image Representation, 8, No.2:107—120, June 1997.

[A32] M. Irani and P. Anandan. Video indexing based on mosaic representations. In Proceedings of IEEE, to appear, 1997.

[A33] M. Irani, S. Hsu, and P. Anandan. Mosaic-based video compression. In SPIE Vol.2419, pages 242—253, 1998.

[A34] G. Iyengar and A.B. Lippman. Videobook: An experiment in characterization of video. In ICIP’96, Vol.?, pages 855—858, 1996.

[A35] C.E. Jacobs, A. Finkelstein, and D.H. Salesin. Fast multiresolution image querying. In SIGGRAPH 95, 1995.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[A36] A. Jam and G. Healey. Evaluating multiscale opponent colour features using gabor filters. In ICIP’97, Vol.11, pages 203—206, 1997.

[A37] A.K. Jam. FundamentaLs of Image Processing.

[A38] R. Jam, A. Pentland, and D. Petkovic. Workshop report: Nsf-arpa workshop on visual information management systems. WWW, June 1995.

[A39] J.P. Kelly and M. Cannon. Query by image example: the candid approach. In SPIE Storage and Retrieval for Image and Video Databases III - Vol.2420, pages 238—248, 1995.

[A40] J. Kreyss, M. Roeper, P. Alshuth, T~ Hermes, and O.Herzog. Video retrieval by still image analysis with imageminer. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 36—44, 1997.

[A41] J. Lee and B.W. Dickinson. Multiresolution video indexing for subband coded video databases. In SPIE Retrieval and Storage of Image and Video Databases, Vol.2185, pages 162—173, 1994.

[A42] Y. Li, B. Tao, S. Kei, and W. Wolf. Semantic image retrieval though subject seg-mentation and characterization. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 340—351, 1997.

[A43] K.C. Liand, X. Wan, and C.-C. Jay Kuo. Indexing retrieval and browsing of wavelet compressed imagery data. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 506—517, 1997.

[A44] F. Liu. Modelling Spatial and Temporal Textures. PhD thesis, MIT MediaLab, 1996.

[A45] W.Y. Ma, Y. Deng, and B.S. Manjunath. Tools for texture/color based search ofimages. In SPIE, Vol.3106, 1997.

[A46] W.Y. Ma and B.S. Manjunath. Netra: A toolbox for navigating large image databases. In ICIP’97, Vol. II, pages 568—571, 1997.

[A47] J. Malik, D.A. Forsyth, M.M. Fleck, H. Greenspan, T. Leung, C. Carson, S. Belongie, and C. Bregler. Finding objects in image databases by grouping. In ICIP96, Vol.2, pages 761—764, 1996.

[A48] M.K. Mandal, S. Panchanathan, and T. Aboulnasr. Image indexing using translation and scale-invariant moments and wavelets. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 380—389, 1997.

[A49] B.S. Manjunath and W.Y. Ma. Browsing large satellite and aerial photographs. In ICIP’96, Vol.2, pages 765—768, 1996.

[A50] S. Mann and R.W. Picard. Virtual bellows: constructing high quality stills from video. In First IEEE International Conference on Image Processing, 1994.

[A51] J. Mao and A.K. Jam. Texture classification and segmentation using multiresolution simultaneous autoregressive models. Pattern Recognition, 25, No.2:173—188, 1992.

[A52] J. Meng and S.-F. Chang. Tools for compressed-domain video indexing and editing. In SPIE Storage and Retrieval for Still Image and Video Databases IV, Vol.2670, pages 180—191, 1997.

[A53] K. Messer and J. Kittler. Selecting features for neural networks to aid an iconic search through an image database. In lEE 6th International Conference on Image Processing and Its Applications, pages 428—432, 1997.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[A54] K. Messer and J. Kittler. Using feature selection to aid an iconic search through an image database. In ICASSP’97, page Vol.4, 1997.

[A55] T.P. Minka and R. Picard. Interactive learning with a ‘society of models’. In CVPR, pages 447—452, 1996.

[A56] H. Mo, S. Satoh, and M. Sakauchi. A study of image recognition using similarity re-trieval. In First International Conference on Visual Information Systems (Visual’96), pages 136—141, 1996.

[A57] F. Mokhtarian, S. Abbasi, and J. Kittler. Efficient and robust retrieval by shape through curvature scale space. In Proceedings of the First International Workshop on Image Databases and Multi-Media Search, pages 35—42, Aug 1996.

[A58] J. Monaco. How to read a film: the art, technology, language, and theory of film and media. Oxford University Press, 1977.

[A59] W. Niblack, R. Barber, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Falout-sos, and G. Taubin. The qbic project: Querying images by content using colour, texture, and shape. In SPIE, pages 173—187, 1993.

[A60] A. Pentland, R.W. Picard, and S. Sciaroff. Photobook: Content-based manipulation of image databases. Intern. J. Comput. Vision, 18(3):233—254, 1996.

[A61] R.W. Picard. Content access for image/video coding: “the fourth criterion”. Statement for Panel on “Computer Vision and Image/Video Compression”, ICPR94, 1994.

[A62] R.W. Picard. Light-years from lena: Video and image libraries of the future. In ICIP’95, 1995.

[A63] R.W. Picard. A society of models for video and image libraries. IBM Systems Journal, 35, No.3 and 4, 1996.

[A64] T. Randen and J.H. Hussoy. Image content search by color and texture properties. In ICIP’97, Vol.11, pages 580-583, 1997.

[A65] A. Ravishankar Rao, N. Bhushan, and G.L. Lohse. The relationship between texture terms and texture images: A study in human texture perception. In SPIE Storage and Retrievalfor Still Image and Video Databases IV, Vol.2670, pages 206—214, 1996.

[A66] R. Reeves, K. Kubik, and W. Osberger. Texture characterization of compressed aerial images using dct coefficients. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 398—407, 1997.

[A67] R. Hickman and J. Stonham. Content-based image retrieval using colour tuple his-tograms. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.2670, pages 2-7, 1996.

[A68] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 203—208, 1996.

[A69] E. Saber and A. Murat-Tekaip. Integration of color, shape, and texture for image annotation and retrieval. In ICIP’96, Vol.?, pages 851-854, 1996.

[A70] E. Saber and A. Murat-Tekalp. Region-based image annotation using colour and texture cues. In EUSIPCO-96, Vol.3, pages 1689—1692, 1996.

[A71] E. Saber and A. Murat-Tekalp. Region-based shape matching for automatic image annotation and query-by-example. Journal of Visual Communication and Image

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Representation, 8, Nol:3—20, March 1997.

[A72] W. Sack and M. Davis. Idic: Assembling video sequences from story plans and content annotations. In mt. Conf. on Multimedia Computing and Systems 94, pages 30-36, 1994.

[A73] S. Santini and R. Jam. Similarity matching. Submitted to: IEEE Trans. on Pattern Analysis and Machine Intelligence, 1996.

[A74] S. Santini and R. Jam. Similarity queries in image databases. In CVPR ‘96, pages 646—651, 1996.

[A75] S. Santini and R. Jam. Do images mean anything. In ICIP’97, pages 564—567, 1997.

[A76] D.D. Saur, Y.-P. Tan, S.R. Kulkarni, and P.j. Ramadge. Automated analysis and annotation of basketball video. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 176—187,?

[A77] C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1997.

[A78] C. Schmid, R. Mohr, and C. Baukhage. Comparing and evalutating interest points. In ICC V’98, 1998.

[A79] S. Sciaroff, L. Taycher, and M. La Cascia. Image rover: A content-based image browser for the world wide web. In IEEE Workshop on Content-based Access of Image and Video Libraries, 1997.

[A80] W.B. Seales, C.J. Yuan, W. Hu, and M.D. Cutts. Content analysis of compressed video. Technical report, University of Kentucky, 1996.

[A81] K. Shearer, S. Venkatesh, and D. Kieronska. Spatial indexing for video databases. Journal of Visual Communication and Image Representation, 4:325—335, December 1996.

[A82] J.R. Smith and S.-F. Chang. Local color and texture extraction and spatial query. In ICIP’96, Vol.3, pages 1011—1014, 1996.

[A83] J.R. Smith and S.-F. Chang. Searching for images and videos on the world-wide web. Technical report, Columbia University, 1996.

[A84] M.A. Smith and T. Kanade. Video skimming and characterization through the com-bination of image and language understanding techniques. Technical report, Carnegie Mellon University, 1997.

[A85] H.S. Stone. Image matching by means of intensity and texture matching in the fourier domain. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.2670, pages 337—349, 1996.

[A86] M. Stricker and A. Dimai. Colour indexing with weak spatial constraints. In SPIE Retrieval and Storage of Image and Video Databases, 1996.

[A87] M. Stricker and M. Orengo. Similarity of colour images. In SPIE Retrieval and Storage of Image and Video Databases, 1995.

[A88] M.A. Stricker. Bounds for the discrimination power of color indexing techniques. In SPIE Retrieval and Storage of Image and Video databases, Vol. 2185, pages 15—24, 1994.

[A89] M.J. Swain and D.H. Ballard. Color indexing. Intern. .1. Comput. Vision, 7:1:11—32, 1991.

[A90] M.J. Swain, C. Frankel, and V. Athitsos. Webseer: An image search engine for the

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

world wide web. Technical report, University of Chicago, 1996.

[A91] D. Swanberg, C.-F. Shu, and R. Jam. Knowledge guided parsing in video databases. In SPIE vol.1908, pages 13—21, 1993.

[A92] B. Tao and B. Dickinson. Template-based image retrieval. In ICIP’96, Vol.3, pages 781—874, 1996.

[A93] P.H.S. Thor, A. Zisserman, and D.W. Murray. Motion clustering using the trilinear constraint over three views. In Europe-China Workshop on Geometrical Modelling and Invariants for Computer Vision, 118-125.

[A94] A. Twersky. Features of similarity. Psychological Review, 84(4):327—352, July 1977.

[A95] N. Vasconcelos and A. Lippman. Towards semantically meaningful feature spaces for the characterization of video content. In ICIP’97, Vol.1, pages 25—28, 1997.

[A96] X. Wan and C.-C. Jay Kuo. Colour distribution analysis and quantization for image retrieval. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.2670, pages 8—16, 1996.

[A97] J. Ze Wang, G. Wiederhold, 0. Firschein, and S.X. Wei. Applying wavelets in image database retrieval. Technical report, Stanford University, 1996.

[A98] J. Ze Wang, G. Wiederhold, 0. Firschein, and S.X. Wei. Wavelet-based image index-ing techniques with partial sketch retrieval capability. In Proc. of the Fourth Forum on Research and Technology Advances in Digital Libraries, 1997.

[A99] D.A. White and R. Jam. Similarity indexing: algorithms and performance. In SPIE Storage and Retrieval for Still Image and Video Databases IV, Vol.2670, pages 72—73, 1996.

[A100]D.A. White and R. Jam. Imagegrep: Fast visual pattern matching in image databases. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 96—107, 1997.

[A101]M. Wood, N. Campbell, and B.T. Thomas. Employing region features for searching an image database. In BMVC’97, pages 620—629, 1997.

[A102]W. Xiong, R. Ma, and J.C.-M. Lee. Novel technique for automatic key frame comput-ing. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 166—173, 1997.

[A103]M.M. Yeung and B.-L. Yeo. Time-constrained clustering for segmentation of video into story units. In ICPR ‘96, Vol.3, pages 375—380, 1996.

[A104]M.M. Yeung and B.-L. Yeo. Video content characterization and compaction for digital library applications. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.3022, pages 45—58, 1997.

[A105]M.M. Yeung, B-L. Yeo, W. Wolf, and B. Liu. Video browsing using clustering and scene transitions on compressed sequences. In SPIE Multimedia Computing and Networking 1995, 1995.

[A106]H. Zhang, Y. Gong, S.W. Smoliar, and S.Y. Tan. Automatic parsing of news video. In International Conference on Multimedia Computing and Systems, pages 45—54, 1994.

[A107]H. Zhang, J.Y. Wang, and Y. Altunbasak. Content-based video retrieval and com-pression: A unified solution. In ICIP’97, pages 13—16, 1997.

[A108]D. Zhong, HJ. Zhang, and S.-F. Chang. Clustering methods for video browsing and

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

annotation. In SPIE Storage and Retrieval for Still Image and Video Databases V, Vol.2670, pages 239—246, 1996.

[A109]HJ. Ziang and S. Smoliar. Developing power tools for video indexing and retrieval. In SPIE Vol.2185, pages 140—149, 199

12.2 References B

[B1] H. J. Zhang, A. Kankanhalli, & S. W. Smoliar, “Automatic Partitioning of Full-motion Video,” ACM Multimedia System, Vol. 1, No. 1, pp. 10-28, 1993.

[B2] N. V. Patel, I. K. Sethi, “Compressed Video Processing for Cut Detection,” lEE Proc. Visual Image Signal Process, vol. 143, no. 5, pp. 315-23, Oct 1996.

[B3] B.L. Yeo and B. Liu, “On the Extraction of DC Sequence from MPEG Compressed Video,” IEEE int. Conf. on Image Processing, vol. 2, pp. 260-30, Oct 1995.

[B4] E. Deardorif, T. D. C. Little, J. D. Marshall, D. Venkatesh & R. Waizer, “Video Scene Decomposition with the Motion Picture Parser,” SHE Conf. Digital Video Compression on Personal Computers: Algorithms and Technologies, Vol. 2187, pp. 44-55, 1994.

[B5] Y. Deng and B.S. Manjunath, “Content-based Search of Video using Color, Texture, and Motion”, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 534-537, 1997.

[B6] R. Zabih, J. Miller & K. Mal, “A Feature-Based Algorithm for Detecting and Classifying Scene Breaks,” Proc. ACM Intl. Conf. Multimedia’g5, pp. 189-200, Nov 1995.

[B7] Bo Shen & Donger Li, Tshwar K. Sethi, “HDH Based Compressed Video Cut Detection,” Second Intl. Conf. on Visual Information Systems, pp. 149-156, Dec 1997.

[B8] M. Yeung, B. L. Yeo & B. Liy, “Extracting Story Units from Long Programs for Video Browsing and Navigation,” In Proc IEEE Conf. on Multimedia Commuting and Systems, 1996.

[B9] Y. Rui, T. S. Huang & S. Mehrotra, “Exploring Video Structure Beyond The Shots ,“ Proc. of IEEE Intel. Conf. on Multimedia Computing and Systems (ICMCS) ,June 28-July 1, 1998.

[B10] J. Meng & S.-F. Chang, “Tools for Compressed-Domain Video Indexing and Editing,” Proc. SPIE Storage and Retrieval for Image and Video Database IV, vol. 2670, pp. 180-91, Feb 1996.

[B11] K. J. Han & A. H. Tewfik, “Eigen-Image Based Video Segmentation and Indexing,” IEEE Intl. Conf. on Image Processing, ICIP’97, Oct 1997.

[B12] W. Xiong, C. M. Lee, R. H. Ma, “Automatic video data structuring through shot partitioning and key-frame computing,” Machine Vision and Applications, vol.10, no.2, pp. 51-65, 1997.

[B13] P. 0. Gresle & T. S. Huang, “Gisting of Video Documents: A Key Frames Selection Algorithm Using Relative Activity Measure,” The 2nd [Bnt. Conf. on Visual Information System, pp. 279-86, 1997.

[B14] R. Bolle, Y. Aloimonos, & C. Fermuller, “Toward Motion Picture Grammars,” Third Asian Conf. on Computer Vision, ACCV’98, vol. 2, pp. 283-290, 1998.

[B15] M. M. Yeung, B. L. Yeo, “Video Content Characterization and Compaction for digital library applications,” SPIE Conf. Storage & Retrieval for Image and Video Databases V, pp.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

45-58, 1997.

[B16] F. Pereira, “MPEG-7: A Standard for Content-Based Audiovisual Description,” Second Intl. Conf. on Visual Information Systems, pp. 1-4, Dec 1997.

[B17] B. L. Yeo & M. M. Yeung, “Classification, Simplification and Dynamic Visualization Scene Transition Graphs for Video Browsing,” SPIE Conf. Storage & Retrieval for Image and Video Database VI, pp. 60-70, 1998.

[B18] Michael A. Smitch & Takeo Kanade, “Video Skimming and Characterization Through the Combination of Image and Language Understanding Techniques,” IEEE Conf. Computer Vision and Pattern Recognition, CVPR, pp. 775-781, June 1997.

[B19] J. R. Smith and S. F. Chang, “Visually Searching the Web for Content,” IEEE Multimedia Magazine, Summer, Vol.. 4 No. 3, pp.12-20, 1997.

[B20] Sclaroff, L. Taycher, & M. La Cascia, “TmageRover: A Content-Based Image Browser for the World Wide Web,” Proc. IEEE Workshop on Content-Based Access of Image and Video Libraries, pp2-9, 1997.

[B21] C. W. Ngo, T. C. Pong & R. T. Chin, “Exploiting Image Indexing Techniques in DCT domain,” IAPR International Workshop on Multimedia Media Information Analysis and Retrieval, to appear, 1998.

[B22] Y. Rui, T. S. Huang & S. Mehrotra, “Relevance Feedback Techniques in Interactive Content-Based Image Retrieval,” Proc. SPIE Storage and Retrieval for Still Image and Video Database VI, vol. 3312, pp. 25-36, 1998.

[B23] Y. S. Hsu, S. Prum, J. H. Kagel, and H. C. Andrews, “Pattern Recognition experiments in the Mandala/cosine domain,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, pp. 512-520, Sept, 1983.

[B24] K. C. Liang, X. Wan, C. C. J. Kuo, “Indexing, Retrieval, and Browsing of Wavelet Compressed Imagery Data,” SPIE Conf. Storage & Retrieval for Image and Video Databases V, pp. 506-517, 1997.

[B25] Janko Calic and E. Izquierdo, “Temporal Segmentation of MPEG Video Streams”, EURASIP Journal on Applied Signal Processing, special issue on Image Analysis for Multimedia Interactive Services, Part II, Jun., 2002.

[B26] J. Calic, S. Sav, E. Izquierdo, S. Marlow, N. Murphy and N.E. O’Connor, "Temporal Video Segmentation for Real-Time Key Frame Extraction", Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP'2002, Orlando, Florida, May 2002, 4 pages

[B27] J. Calic and E. Izquierdo, "Efficient Key-Frame Extraction and Video Analysis", Proc. of IEEE ITCC 2002, Las Vegas, Nevada, Apr. 2002.

[B28] J. Calic and E. Izquierdo, "A Multiresolution Technique for Video Indexing and Retrieval", submitted to IEEE Int. Conf. On Image Processing, ICIP2002, Rochester, New York, Sep. 2002

[B29] A Survey of Technologies for Parsing and Indexing Digital Video, Boston University, http://hulk.bu.edu/pubs/papers/1995/ahanger-jvcir95/TR-11-01-95.html

[B30] Arun Hampapur, Ramesh Jain and Terry E Weymouth, "Feature Based Digital Video Indexing"

[B31] Stephen W Smoliar, HongJiang Zhang, and Jian Hua Wu. "Using frame technology to

Deliverable 2.1

http://hulk.bu.edu/pubs/papers/1995/ahanger-jvcir95/TR-11-01-95.html

http://www.elec.qmul.ac.uk/internet/janko/Publikacije/ITCC2002.pdf

SCHEMA IST-2001-32795 05/06/2003

manage video." In Proc. of the Workshop on Indexing and Reuse in Multimed ia Systems . American Association of Artificial Intelligence, August 1994

[B32] Deborah Swanberg, Chiao-Fe Shu, and Ramesh Jain. "Architecture of a multimedia information system for content-based retrieval." In Audio Video Workshop, San Diego, California, November 1992.

[B33] Deborah Swanberg, Chiao-Fe Shu, and Remesh Jain. "Knowledge guided parsing in v ideo databases." Electronic Imaging: Science and Technology, San J ose, California, February 1993. IST/SPIE.

[B34] Marc Davis. Media streams: "An iconic visual language for video annotati on." In IEEE Symposium on Visual Languages, pp. 196-202. IEEE Comp uter Society, 1993.

[B35] Marc Davis. "Knowledge representation for video." In Working Notes : Workshop on Indexing and Reuse in Multimedia Systems, pp. 19-28. Ameri can Association of Artificial Intelligence, August 1994.

[B36] Ramesh Jain and Arun Hampapur. "Metadata in video databases" In Sigmod Record: Special Issue On Metadata For Digital Media. ACM:SIGMOD, December 1994.

[B37] Informedia:News-on-demand Multimedia Information Acquisition and Retrieval , Intelligent Multimedia Information Retrieval, Mark T. Maybury, Ed., AAAI press, pp. 213-239, 1997

[B38] Multimedia Summaries of Broadcast News, Mark Maybury, Intelligent Information Systems, 1997.

[B39] Rainer Lienhart, Silvia Pfeiffer, and Wolfgang Effelsberg, "Video Abstracting", Communications of the ACM

[B40] E. Izquierdo and M. Ghanbari, "Key Components for an Advanced Segmentation Toolbox", IEEE Transactions on Multimedia, Vol. 4, No. 1, Mar. 2002.

[B41] L. Alvarez, P. L. Lions and J. M. Morel, “Image Selective Smoothing and Edge Detection by Nonlinear Diffusion. II“, SIAM J. Numer. Anal., Vol. 29, No. 3, 1992, pp. 845-866.

[B42] H. H. Baker and T. O. Binford, “Depth from edge and intensity based stereo“, Proc. 7th Int. Joint conf. Artificial Intell., Vancouver, Canada, Aug. 1981, pp. 631-636.

[B43] S. Beucher and F. Meyer, “The morphological approach to segmentation: The watershed transformation“, in Mathematical Morphology in Image Processing (E. R. Dougherty, Ed.), Marcel Dekker, New York, 1993, pp. 433-481.

[B44] G. Borshukov, G. Bozdagi, Y. Altunbasak and M. Tekalp, “Motion Segmentation by Multistage Affine Classification“, IEEE Transaction on Image Processing, vol. 6, no. 11, Nov. 1997, pp. 1591-1594.

[B45] S. Boukharouba, J. M. Rebordao and P. L. Wendel, “An amplitude segmentation method based on the distribution function of an image“, Compute Vision, Graphics, and Image Processing, vol. 29, 1985, pp. 47-59.

[B46] F. Catté, F. Dibos and G. Koeppler, “A Morphological Scheme for Mean Curvature Motion and Applications to Anisotripic Diffusion and Motion of Level Sets“, SIAM J. Numer. Anal., Vol. 32, No. 6, 1995, pp. 1895-1909.

[B47] F. Catté, P. L. Lions, J. M. Morel and T. Coll, “Image Selective Smoothing and Edge Detection by Nonlinear Diffusion I“, SIAM J. Numer. Anal., Vol. 29, No. 1, 1992, pp. 182-

Deliverable 2.1

http://ailab.kyungpook.ac.kr/~kcjung/research/video_on_demand/references/video_abstracting_cacm.pdf

http://ailab.kyungpook.ac.kr/~kcjung/research/video_on_demand/references/mmiis97.html

http://ailab.kyungpook.ac.kr/~kcjung/research/video_on_demand/references/informedia_nod.pdf

SCHEMA IST-2001-32795 05/06/2003

193.

[B48] M. Chang, M. Tekalp and, I. Sezan “Simultaneous Motion Estimation and Segmentation“, IEEE Transaction on Image Processing, vol. 6, no. 9, Sep. 1997, pp. 1326-1333.

[B49] D. De Vleesschauwer, F. Alaya Cheikh, R. Hamila, M. Gabbouj, “Watershed Segmentation of an Image Enhanced by Teager Energy Driven Diffusion“, Sixth Int. Conf. on Image Processing and its Applications, Jul. 1997, pp. 254-258.

[B50] D. De Vleesschauwer, P. De Smet, F. Alaya Cheikh, R. Hamila, M. Gabbouj, “Optimal Performance of the Watershed Segmentation on an Image Enhanced by Teager Energy Driven Diffusion“, Proc. On VLBV´98, Oct. 1998, pp. 137-140.

[B51] E. Francois and B. Chupeau, “Depth-based segmentation“, IEEE Transaction on Circuits and Systems for Video Technology, vol. 7, no. 1, Feb. 1997, pp. 237-239.

[B52] W. Hoff and N. Ahuja, “Surfaces from Stereo: Integrating Feature Matching, Disparity Estimation, and Contour Detection“, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. PAMI-11, no. 2, 1989, pp. 121-136.

[B53] A. Ibenthal, S. Siggelkow, R. R. Grigat, “Image sequence segmentation for object-oriented coding“, Proc. On European Symposium on Advanced Imaging and Network Technologies, SPIE vol. 2952, Berlin, Germany, 1996, pp. 2-11.

[B54] E. Izquierdo, “Stereo matching for enhanced telepresence in 3D-videocommunications“, IEEE Transaction on Circuits and Systems for Video Technology, Special issue on Multimedia Technology, Systems and Applications, vol. 7, no. 4, Aug. 1997, pp. 629-643.

[B55] E. Izquierdo and S. Kruse, “Disparity Controlled Segmentation“, Proc. On Picture Coding Symposium 97, Berlin, Germany, 1997, pp. 737-742.

[B56] E. Izquierdo and M. Ghanbari, “Accurate Curve matching for Object-Based Motion Estimation“, Electronic Letters, Oct. 1998.

[B57] A. Kalvin, E. Schonberg, J. T. Schwartz and M. Sharir, “Two dimensional model based boundary matching using footprints“, Int. J. Robotics Res., vol. 5, no. 4, 1986, pp. 38-55.

[B58] J. J. Koenderink, “The Structure of Images“, Biol. Cybernet. 50, 1984, pp. 363-370.

[B59] S. Kruse, “Scene segmentation from dense displacement vector fields using randomized Hough transform“, Signal Processing: Image Communication, Vol. 9, 1996, pp. 29-41.

[B60] F. Meyer and S. Beucher, “Morphological segmentation“, J. of Visual Communication and Image Representation 1, 1990, pp. 21-46.

[B61] J. R. Ohm and E. Izquierdo, “An object-based system for stereoscopic viewpoint synthesis“, IEEE Transaction on Circuits and Systems for Video Technology, Special issue on Multimedia Technology, Systems and Applications, Oct. 1997, pp. 801-811.

[B62] P. Perona and J. Malik, “Scale Space and Edge Detection Using Anisotropic Diffusion“, Proc. IEEE Comput. Soc. Workshop on Comput. Vision, 1987, pp. 16-22.

[B63] M. I. Sezan, “A peak detection algorithm and its application to histogram-based image data reduction“, Computer Vision, Graphics, and Image Processing, Vol. 49, 1990, pp. 36-51.

[B64] D. Tzovaras, N. Grammalidis and M. G. Strintzis “Object-Based Coding of Stereo

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

Image Sequences Using Joint 3-D Motion/Disparity Compensation“, IEEE Transaction on Circuits and Systems for Video Technology, Special issue on Multimedia Technology, Systems and Applications, vol. 7, no. 2, Apr. 1997, pp. 312-328.

[B65] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficient algorithm based on immersion simulations“, IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 1991, pp. 583-598.

[B66] L. Vincent, “Morphological algorithms“, in Mathematical Morphologie in Image Processing (E. R. Dougherty, Ed.), Marcel Dekker, New York, 1993, pp. 255-288.

[B67] A. P. Witkin, “Scale-Space Filtering“, Proc. IJCAI, Karlsruhe, 1983, pp. 1019-1021.

[B68] H. J. Wolfson, “On Curve Matching“, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-12, no. 5, pp. 483-489, 1990.

[B69] M. Wollborn and R. Mech, “Procedure for Objective Evaluation of VOP Generation Algorithms“, Doc. ISO/IEC JTC1/SC29/WG11 MPEG97/2704, Fribourg, Switzerland, Oct. 1997.

[B70] P. Salembier and F. Marques, “Region-Based Representations of Image and Video: Segmentation Tools for Multimedia Services”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 9, no. 8, pp. 1147-1169, Dec. 1999.

[B71] N.V. Boulgouris, I. Kompatsiaris, V. Mezaris, D. Simitopoulos and M.G. Strintzis, “Segmentation and Content-based Watermarking for Color Image and Image Region Indexing and Retrieval”, EURASIP Journal on Applied Signal Processing, April 2002.

[B72] P. Perona and J. Malik, “Scale Space and Edge-Detection Using Anisotropic Diffusion”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, pp. 629-639, July 1990.

[B73] J. A. Noble, “The effect of morphological filters on texture boundary localization”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 5, pp. 554-561, May 1996.

[B74] P. Soille and H. Talbot, “Directional morphological filtering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1313-1329, Nov. 2001.

[B75] Chad Carson, Serge Belongie, Hayit Greenspan and Jitendra Malik, “Color- and Texture-Based Image Segmentation Using EM and Its Application to Image Querying and Classification”, IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear, 2002.

[B76] L. Shafarenko, H. Petrou and J. Kittler, “Histogram-based segmentation in a perceptually uniform color space”, IEEE Transactions on Image Processing, vol. 7, no 9, pp. 1354-1358, Sept. 1998.

[B77] S. Liapis, E. Sifakis and G. Tziritas, “Color and/or Texture Segmentation using Deterministic Relaxation and Fast Marching Algorithms”, Intern. Conf. on Pattern Recognition, vol. 3, pp. 621-624, Sept. 2000.

[B78] M. Unser, “Texture classification and segmentation using wavelet frames”, IEEE Trans. on Image Processing, vol. 4, no. 11, pp. 1549-1560, Nov. 1995.

[B79] T. Chang and J. Kuo, “Texture analysis and classification with tree-structured wavelet transform,” IEEE Trans. Image Processing, vol. 2, pp. 429-441, Oct. 1993.

[B80] E. Reusens, “Joint optimization of representation model and frame segmentation for

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

generic video compression”, EURASIP Signal Processing, 46(11):105-117, September 1995.

[B81] X. Wu, “Adaptive split-and-merge segmentation based on piecewise least-square approximation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no 8, pp. 808-815, Aug. 1993.

[B82] H. S. Yang and S. U. Lee, “Split-and-merge segmentation employing thresholding technique”, in Proceedings International Conference on Image Processing, 1997, Volume: 1, pp. 239-242.

[B83] L. Vincent and P. Soille, “Watersheds in Digital Spaces: an efficient algorithm based on immersion simulations”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):1845-1855, December 1991.

[B84] S. Beucher and F. Meyer, “The morphological approach to segmentation: The watershed transformation”, Mathematical Morphology in Image Processing, Marcel Dekker, New York, pp.433-481, 1993.

[B85] K. Haris, S. N. Efstratiadis, N. Maglaveras and A. K. Katsaggelos, “Hybrid image segmentation using watersheds and fast region merging”, IEEE Transactions on Image Processing, vol. 7, no 12, pp. 1684-1699, Dec. 1998.

[B86] J. M. Gauch, “Image segmentation and analysis via multiscale gradient watershed hierarchies”, IEEE Transactions on Image Processing, vol. 8, no 1, pp. 69-79, Jan. 1999.

[B87] Hai Gao and Wan-Chi Siu and Chao-Huan Hou, “Improved techniques for automatic image segmentation”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 12, pp. 1273-1280, Dec. 2001.

[B88] S. Beucher, “Watershed, hierarchical segmentation and waterfall algorithm”, Mathematical Morphology and its Applications to Image Processing, Boston, MA, Kluwer, 1994, pp. 69-76.

[B89] L. Shafarenko, H. Petrou and J. Kittler, “Automatic watershed segmentation of randomly textured color images”, IEEE Transactions on Image Processing, vol. 6, no 11, pp. 1530-1544, Nov. 1997.

[B90] I. Kompatsiaris and M. G. Strintzis, “Spatiotemporal Segmentation and Tracking of Objects for Visualization of Videoconference Image Sequences”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 10, no. 8, Dec. 2000.

[B91] J. Canny, “Computational approach to edge detection”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, pp. 679-698, Nov. 1986.

[B92] P. L. Palmer, H. Dabis and J. Kittler, “A performance measure for boundary detection algorithms”, Comput. Vis. Image Understanding, vol. 63, pp. 476-494, 1996.

[B93] L. H. Staib and J. S. Duncan, “Boundary Finding With Parametric Deformable Models”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, pp. 161-175, 1992.

[B94] M. Kass, A. Witkin and D. Terzopoulos, “Snakes: Active Contour Models”, Int. Journal Comput. Vis., vol. 1, pp. 313-331, 1998.

[B95] T.F. Chan and L.A. Vese, “Active contours without edges”, IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 266 –277, Feb. 2001.

[B96] Wei-Ying Ma and B.S. Manjunath, “EdgeFlow: a technique for boundary detection and image segmentation”, IEEE Transactions on Image Processing, vol. 9, no. 8, pp. 1375-

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

1388, Aug. 2000.

[B97] N. Giordana and W. Pieczynski, “Unsupervised segmentation of multisensor images using generalized hidden Markov chains”, Proceedings International Conference on Image Processing, 1996, Volume: 3 pp. 987-990.

[B98] L. Fouque, A. Appriou and W. Pieczynski, “Multiresolution hidden Markov chain model and unsupervised image segmentation”, Proceedings 4th IEEE Southwest Symposium Image Analysis and Interpretation, 2000, pp. 121-125.

[B99] Zhuowen Tu and Song-Chun Zhu, “Image segmentation by data-driven markov chain monte carlo”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 657 –673, May 2002.

[B100] L. Lucchese and S.K. Mitra, “Colour segmentation based on separate anisotropic diffusion of chromatic and achromatic channels”, IEE Proceedings Vision, Image and Signal Processing, vol. 148, no. 3, pp. 141 –150, June 2001.

[B101] Song Chun Zhu and A. Yuille, “Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 9, pp. 884-900, Sept. 1996.

[B102] Jaesang Park andJ.M. Keller, “Snakes on the watershed”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, no. 10, pp. 1201-1205, Oct. 2001.

[B103] Jianping Fan and D.K.Y. Yau and A.K. Elmagarmid and W.G. Aref, “Automatic image segmentation by integrating color-edge extraction and seeded region growing”, IEEE Transactions on Image Processing, vol. 10, no. 10, pp. 1454-1466, Oct. 2001.

12.3 References C

[C1] C. S. McCamy, H. Marcus, and J. G. Davidson. A colour-rendition chart. Journal of Applied Photographic Engineering, 2(3), Summer 1976.

[C2] Makoto Miyahara. Mathematical transform of (r,g,b) colour data to munsell (h,s,v) colour data. In SPIE Visual Communications and Image Processing, volume 1001, 1988.

[C3] Jia Wang, Wen-Jann Yang, and Raj Acharya. Colour clustering techniques for colour-content-based image retrieval from image databases. In Proc. IEEE Conf. on Multimedia Computing and Systems, 1997.

[C4] Michael Swain and Dana Ballard. Colour indexing. International Journal of Computer Vision, 7(1), 1991.

[C5] Mikihiro Ioka. A method of defining the similarity of images on the basis of colour information. Technical Report RT0030, IBM Research, Tokyo Research Laboratory, November 1989.

[C6] W. Niblack, R. Barber, and et al. The QBIC project: Querying images by content using colour, texture and shape. In Proc. SPIE Storage and Retrieval for Image and Video Databases, Feb 1994.

[C7] Markus Stricker and Markus Orengo. Similarity of colour images. In Proc. SPIE Storage and Retrieval for Image and Video Databases, 1995.

[C8] John R. Smith and Shih-Fu Chang. Single colour extraction and image query. In Proc. IEEE Int. Conf. on Image Proc., 1995.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[C9] John R. Smith and Shih-Fu Chang. Tools and techniques for colour image retrieval. In IS & T/SPIE proceedings Vol.2670, Storage & Retrieval for Image and Video Databases IV, 1995.

[C10] John R. Smith and Shih-Fu Chang. Automated binary texture feature sets for image retrieval. In Proc ICASSP96, Atlanta, GA, 1996.

[C11] Robert M. Haralick, K. Shanmugam, and Its'hak Dinstein. Texture features for image classification. IEEE Trans. on Sys, Man, and Cyb, SMC-3 (6), 1973.

[C12] Calvin C. Gotlieb and Herbert E. Kreyszig. Texture descriptors based on co-occurrence matrices. Computer Vision, Graphics, and Image Processing, 51, 1990.

[C13] Hideyuki Tamura, Shunji Mori, and Takashi Yamawaki. Texture features corresponding to visual perception. IEEE Trans. on Sys, Man, and Cyb, SMC-8 (6), 1978.

[C14] Will Equitz and Wayne Niblack. Retrieving images from a database using texture algorithms from the QBIC system. Technical Report RJ 9805, Computer Science, IBM Research Report, May 1994.

[C15] Thomas S. Huang, Sharad Mehrotra, and Kannan Ramchandran. Multimedia analysis and retrieval system (MARS) project. In Proc of 33rd Annual Clinic on Library Application of Data Processing – Digital Image Access and Retrieval, 1996.

[C16] Michael Ortega, Yong Rui, Kaushik Chakrabarti, Sharad Mehrotra, and Thomas S. Huang. Supporting similarity queries in MARS. In Proc. of ACM Conf. on Multimedia, 1997.

[C17] John R. Smith and Shih-Fu Chang. Transform features for texture classification and discrimination in large image databases. In Proc. IEEE Int. Conf. on Image Proc., 1994.

[C18] Tianhorng Chang and C.-C. Jay Kuo. Texture analysis and classification with tree-structured wavelet transform. IEEE Trans. Image Proc., 2(4): 429--441, October 1993.

[C19] Andrew Laine and Jian Fan. Texture classification by wavelet packet signatures. IEEE Trans. Patt. Recog. and Mach. Intell., 15(11):1186--1191, 1993.

[C20] M. H. Gross, R. Koch, L. Lippert, and A. Dreger. Multiscale image texture analysis in wavelet spaces. In Proc. IEEE Int. Conf. on Image Proc., 1994.

[C21] Amlan Kundu and Jia-Lin Chen. Texture classification using qmf bank-based subband decomposition. CVGIP: Graphical Models and Image Processing, 54(5): 369--384, September 1992.

[C22] K. S. Thyagarajan, Tom Nguyen, and Charles Persons. A maximum likelihood approach to texture classification using wavelet transform. In Proc. IEEE Int. Conf. on Image Proc., 1994.

[C23] Joan Weszka, Charles Dyer, and Azeril Rosenfeld. A comparative study of texture measures for terrain classification. IEEE Trans. on Sys, Man, and Cyb, SMC-6 (4), 1976.

[C24] Philippe P. Ohanian and Richard C. Dubes. Performance evaluation for four classes of texture features. Pattern Recognition, 25(8): 819--833, 1992.

[C25] G. C. Cross and A. K. Jain. Markov random field texture models. IEEE Trans. Patt. Recog. and Mach. Intell., 5:25--39, 1983.

[C26] A. P. Pentland. Fractal-based description of natural scenes. IEEE Trans. Patt. Recog. and Mach. Intell., 6(6):661--674, 1984.

[C27] W. Y. Ma and B. S. Manjunath. A comparison of wavelet transform features for texture

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

image annotation. In Proc. IEEE Int. Conf. on Image Proc., 1995.

[C28] Yong Rui, Alfred C. She, and Thomas S. Huang. Modified Fourier descriptors for shape representation -- a practical approach. In Proc of First Interna tional Workshop on Image Databases and Multi Me dia Search, 1996.

[C29] C. T. Zahn and R. Z. Roskies. Fourier descriptors for plane closed curves. IEEE Trans. on Computers, 1972.

[C30] E. Persoon and K. S. Fu. Shape discrimination using Fourier descriptors. IEEE Trans. Sys. Man, Cyb., 1977.

[C31] M. K. Hu. Visual pattern recognition by moment invariants, computer methods in image analysis. IRE Transactions on Information Theory, 8, 1962.

[C32] Luren Yang and Fritz Algregtsen. Fast computation of invariant geometric moments: A new method giving correct results. In Proc. IEEE Int. Conf. on Image Proc., 1994.

[C33] Deepak Kapur, Y. N. Lakshman, and Tushar Saxena. Computing invariants using elimination methods. In Proc. IEEE Int. Conf. on Image Proc., 1995.

[C34] David Copper and Zhibin Lei. On representation and invariant recognition of complex objects based on patches and parts. Spinger Lecture Notes in Computer Science series, 3D Object Representation for Computer Vision, pages 139--153, 1995. M. Hebert, J. Ponce, T. Boult, A. Gross, editors.

[C35] Z. Lei, D. Keren, and D. B. Cooper. Computationally fast Bayesian recognition of complex objects based on mutual algebraic invariants. In Proc. IEEE Int. Conf. on Image Proc.

[C36] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Content-based manipulation of image databases. International Journal of Computer Vision, 1996.

[C37] Esther M. Arkin, L. Chew, D. Huttenlocher, K. Kedem, and J. Mitchell. An efficiently computable metric for comparing polygonal shapes. IEEE Trans. Patt. Recog. and Mach. Intell., 13(3), March 1991.

[C38] Gene C.-H. Chuang and C.-C. Jay Kuo. Wavelet descriptor of planar curves: Theory and applications. IEEE Trans. Image Proc., 5(1): 56--70, January 1996.

[C39] H. G. Barrow. Parametric correspondence and chamfer matching: Two new techniques for image matching. In Proc 5th Int. Joint Conf. Artificial Intelligence, 1977.

[C40] Gunilla Borgefors. Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Trans. Patt. Recog. and Mach. Intell., 1988.

[C41] Bingcheng Li and Song De Ma. On the relation between region and contour representation. In Proc. IEEE Int. Conf. on Image Proc., 1995.

[C42] Babu M. Mehtre, M. Kankanhalli, and Wing Foon Lee. Shape measures for content-based image retrieval: A comparison. Information Processing & Management, 33(3), 1997.

[C43] Imothy Wallace and Paul Wintz. An efficient three-dimensional aircraft recognition algorithm using normalized Fourier descriptors. Computer Graphics and Image Processing, 13, 1980.

[C44] Imothy Wallace and Owen Mitchell. Three-dimensional shape analysis using local shape descriptors. IEEE Trans. Patt. Recog. and Mach. Intell., PAMI3(3), May 1981.

[C45] Gabriel Taubin. Recognition and positioning of rigid objects using algebraic moment

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

invariants. In SPIE Vol. 1570 Geometric Methods in Computer Vision, 1991.

[C46] C. Faloutsos, M. Flickner, W. Niblack, D. Petkovic, W. Equitz, and R. Barber. Efficient and effective querying by image content. Technical report, IBM Research Report, 1993.

[C47] Tat Seng Chua, Kian-Lee Tan, and Beng Chin Ooi. Fast signature-based colour-spatial image retrieval. In Proc. IEEE Conf. on Multimedia Computing and Systems, 1997.

[C48] H Lu, B. Ooi, and K. Tan. Efficient image retrieval by colour contents. In Proc. of the 1994 Int. Conf. on Applications of Databases, 1994.

[C49] L. Cinque, S. Levialdi, and A. Pellicano, Color-Based Image Retrieval Using Spatial-Chromatic Histograms, IEEE Multimedia Systems 99, vol. II, 969-973, 1999.

[C50] Markus Stricker and Alexander Dimai. Colour indexing with weak spatial constraints. In Proc. SPIE Storage and Retrieval for Image and Video Databases, 1996.

[C51] I. Kompatsiaris, E. Triantafillou and M. G. Strintzis, "Region-Based Colour Image Indexing and Retrieval", 2001 International Conference on Image Processing (ICIP2001), Thessaloniki, Greece, October 7-10, 2001.

[C52] M. Chock et al. Database structure and manipulation capabilities of the picture database management system PICDMS, IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(4), 484-492, 1984.

[C53] N. Roussopoulos et al. An efficient pictorial database system for PSQL, IEEE Transactions on Software Engineering, 14(5), 639-650, 1988.

[C54] S K Chang et al “An intelligent image database system” IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(5), 681-688, 1988.

[C55] S K Chang and E Jungert. Pictorial data management based upon the theory of symbolic projections, Journal of Visual Languages and Computing 2, 195-215, 1991.

[C56] S Tirthapura et al. Indexing based on edit-distance matching of shape graphs, in Multimedia Storage and Archiving Systems III (Kuo, C C J et al, eds), Proc SPIE 3527, 25-36, 1998.

[C57] M Mitra, J Huang, S R Kumar. Combining Supervised Learning with Color Correlograms for Content-Based Image Retrieval, Proc. of the Fifth ACM Multimedia Conference, 1997.

[C58] C. E. Jacobs et al. Fast Multiresolution Image Querying, Proceedings of SIGGRAPH 95, Los Angeles, CA (ACM SIGGRAPH Annual Conference Series, 1995), 277-286, 1995.

[C59] S Ravela and R Manmatha. On computing global similarity in images, in Proceedings of IEEE Workshop on Applications of Computer Vision (WACV98), Princeton, NJ , 82-87, 1998.

[C60] N W Campbell et al. Interpreting Image Databases by Region Classification, Pattern Recognition 30(4), 555-563, 1997.

[C61] C S Carson et al. Region-based image querying, in Proceedings of IEEE Workshop on Content-Based Access of Image and Video Libraries, San Juan, Puerto Rico, 42-49, 1997.

[C62] W Y Ma and B S Manjunath. A texture thesaurus for browsing large aerial photographs, Journal of the American Society for Information Science, 49 (7), 633-648, 1998.

[C63] D Androutsas et al. Image retrieval using directional detail histograms, in Storage and Retrieval for Image and Video Databases VI, Proc SPIE 3312, 129-137, 1998.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[C64] S. Adali, K. S. Candan, S-S. Chen, K. Erol, V. S. Subrahmanian, "Advanced Video Information System: Data Structure and Query Processing", Multimedia System Vol. 4, No. 4, Aug. 1996, pp. 172-86.

[C65] C. Decleir, M-S. Hacid, J. Kouloumdjian, "A Database Approach for Modelling and Querying Video data", LTCS-Report 99-03, 1999.

[C66] H. Jiang, A. Elmagarmid, "Spatial and temporal content-based access to hypervideo databases" VLDB Journal, 1998, No. 7, pp. 226-238.

[C67] J. Z. Li, M. T. Ozsu, D. Szafron, "Modeling of Video Spatial Relationships in an Object Database Management System", Proc. of Int. Workshop on Multi-media Database Management Systems, 1996, pp. 124-132.

[C68] G. Ahanger, D. Benson, and T.D.C. Little, ``Video Query Formulation,'' Proc. IS&T/SPIE, Conference on Storage and Retrieval for Image and Video Databases, Vol. 2420, February 1995, pp. 280-291.

[C69] A.D. Bimbo, M. Campanai, and P. Nesi, `À Three-Dimensional Iconic Environment for Image Database Querying,'' IEEE Trans. on Software Engineering, Vol. 19, No. 10, October 1993, pp. 997-1011.

[C70] S.K. Chang and T. Kunii, ``Pictorial Database Systems,'' IEEE Computer, Ed. S.K. Chang, November 1981, pp. 13-21.

[C71] M. Davis, ``Media Streams: An Iconic Visual Language for Video Annotation," Proc. IEEE Symposium on Visual Languages, Bergen, Norway, 1993, pp. 196-202.

[C72] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkhani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, ``Query by Image and Video Content: The QBIC System,'' IEEE Computer, Vol. 28, No. 9, September 1995, pp. 23-32.

[C73] T. Hamano, `À Similarity Retrieval Method for Image Databases Using Simple Graphics,'' IEEE Workshop on Languages for Automation, Symbiotic and Intelligent Robotics, University of Maryland, August 29-31, 1988, pp. 149-154.

[C74] K. Hirata, and T. Kato, ``Query By Visual Example,'' Proc. 3rd Intl. Conf. on Extending Database Technology, Vienna, Austria, March 1992, pp. 56-71.

[C75] T. Joseph and A.F. Cardenas, ``PICQUERY: A High Level Query Language for Pictorial Database Management,'' IEEE Trans. on Software Engineering, Vol. 14, No. 5, May 1988, pp. 630-638.

[C76] T.D.C. Little, G. Ahanger, R.J. Folz, J.F. Gibbon, A. Krishnamurthy, P. Lumba, M. Ramanathan, and D. Venkatesh, ``Selection and Dissemination of Digital Video via the Virtual Video Browser,'' Journal of Multimedia Tools and Applications, Vol. 1 No. 2, June 1995, pp. 149-172.

[C77] J.A. Orenstein, and F.A. Manola, ``PROBE Spatial Data Modeling and Query Processing in an Image Database Application,'' IEEE Trans. on Software Engineering, Vol. 14, No. 5, pp. 661-629, May 1988.

[C78] N. Roussopoulos, C. Faloutsos, and T. Sellis, `Àn Artificial Pictorial Database System for PQSL,'' IEEE Trans. on Software Engineering, Vol. 14, May 1988, pp. 639-650.

[C79] L.A. Rowe, J.S. Boreczky, C.A. Eads, `Ìndexes for User Access to Large Video Databases,'' Proc. IS&T/SPIE, Storage and Retrieval for Image and Video Databases II, CA, February 1994.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[C80] P. Hill, ‘Review of current content based recognition and retrieval systems’, Technical report 05/1, Virtual DCE.

[C81] M. Flickner et al. “Query by image and video content: The QBIC system”, IEEE Computer 28, pp 23-32, September 1995.

[C82] A. Guttman. ‘R-Trees: A Dynamic Index Structure for Spatial Searching”, Proc. of the 1984 ACM SIGMOD Conf on Management of Data, pp 47-57, June 1984.

[C83] W.Y. Ma and B.S. Manjunath. “Netra: A toolbox for navigating large image databases”, Proc. of IEEE Intl. Conf on Image Processing, vol. 1, pp 568-57 1, 1997.

[C84] Y. Deng and B.S. Manjunath. “NeTra-V: toward an object-based video representation”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 8, no. 5, pp 6 16-627, September 1998.

[C85] W.Y. Ma and B.S. Manjunath. “Edge flow: a framework of boundary detection and image segmentation”, Proc. of iEEE Conf on Computer Vision and Pattern Recognition, pp 744-749, 1997.

[C86] V. E. Ogle and M. Stonebraker. “Chabot: Retrieval from a Relational Database of Images”, IEEE Computer, Vol. 28, No. 9, pp 164-190 September 1995.

[C87] C.E. Jacobs, A.Finkelstein and D.H. Salesin. “Fast Multiresolution Image Querying”, Proc. of SIGGAPH 95, in Computer Graphics Proceedings, Annual Conference Series, pp 277-286, August 1995.

[C88] M. Blume and D.R. Ballard. “Image annotation based on learning vector quantisation and localised Haar wavelet transform features.”, Technical report, Reticular Systems, Inc. 1997.

[C89] J. Ze Wang, G. Wiederhold, 0. Firschein, and S.X. Wei. “Applying wavelets in image database retrieval.”, Technical report, Stanford University, 1996.

[C90] J. Ze Wang, G. Wiederhold, 0. Firschein, and S.X. Wei. “Wavelet-based image indexing techniques with partial sketch retrieval capability.”, Proc. of the Fourth Forum on Research and Technology Advances in Digital Libraries, pp 130-142 1997.

[C91] B.Levianaise-Obadia. “Video Database Retrieval: Literature Review”, VCE Technical Report T8/98-02/1, March 1998.

[C92] J.P. Eakins, K. Shields, and J. Boardman. “Artisan - a shape retrieval system based on boundary family indexing.” Proc. SPIE, vol 2670, pp 17-28, 1996.

[C93] J.R. Smith and S-F. Chang. “Tools and techniques for Color Image Retrieval”, Proc. of SPIE, vol. 2670, pp 426-437, 1996.

[C94] J.R. Smith and S.-F. Chang. “An image and video search engine for the world-wide web”, Proc. of SPIE, vol. 3022, pp 85-95, 1997.

[C95] M.J. Swain, C. Frankel, and V. Athitsos. “Webseer: An image search engine for the world wide web.”, Technical report, University of Chicago, 1996.

[C96] S. Scarlogg, L. Taycher, and M. La Cascia. “Image Rover: A content-based image browser for the world wide web”, Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp 10-18, June 1997.

[C97] A. Pentland, R.W. Picard, and S. Sclaroff. “Photobook: Content-based manipulation of image databases.” Intern. J. Comput. Vision, 18(3), pp 233-254, 1996.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[C98] G. Iyengar and A.B. Lippman. “Videobook: an experiment in characterisation of Video”, ICIP, vol. 3, pp 855-858, 1996.

[C99] T.P. Minka and R. Picard. “Interactive learning with a society of models.”, CVPR, pp 447-452, 1996.

[C100] H. Hasse. “FRAMER: A portable persistent representation library”, Proc. of the MAI Workshop on AI in Systems an Support, Am. Asso.forAl, 1993.

[C101] S.F. Chang, W. Chen, H.J. Meng, H. Sundaram, and D. Zhong. “VideoQ: An automated content based video search system using visual cues.”, ACM Multimedia,1 997.

[C102] A. Hampapur, et.al. “Virage video engine”, Proc. of SPIE, vol. 3022, pp 188-200, 1997.

[C103] J.R. Bach et. al. “Virage image search engine: an open framework for image management” Proc. of SPIE, vol. 2670, pp 76-87, 1996.

[C104] Scott Craver et. a!. “Multi-Linearization Data Structure for Image Browsing”, Proc. SPIE vol 3656, pp 155-166, 1999.

[C105] O.T. Brewer, Jr. “A user interface framework for image searching”, Proc. ~PIE, vol. 3656, pp 573-580, 1999.

[C106] S. Santini and R. Jam. “Interfaces for emergent semantics in multimedia databases”, Proc. SPIE, vol. 3656, pp 167-175, 1999.

[C107] B. Xuesheng, X. Guangyou and S Yuanchun. “Similarity sequence and its application in shot organization”, Proc. SPIE, voL 3656, pp 208-217, 1999.

[C108] M.G. Christel. “Multimedia Abstractions for a Digital Video Library”, Proc. of the ACM Digital Libraries ‘97 Conference., July 1997.

[C109] M. La Cascia and E. Ardizzone. “Jacob: Just a content-based query system for video databases.”, ICASSP’96, 1996.

[C110] J-Y Chen, C.A. Bouman, and John Dalton. “Similarity pyramids for browsing and organization of large image databases”, Proc. SPIE, vol. 3656, pp 144-154, 1999.

12.4 References D

[D1] Amadasun M., King R., Textural features corresponding to textural properties, IEEE Transaction on System, Man and Cybernetics, Vol. SMC-19(5), pp. 1264-1274, 1989.

[D2] Barolo B., Gagliardi I., Schettini R., An effective strategy for querying image databases by color distribution, Computer and the History of Art, Special issue on Electronic Imaging and the Visual Arts, Vol 7(1), pp. 3-14, 1997.

[D3] Binaghi E., Della Ventura A., Rampini A., Schettini R. A fuzzy reasoning approach to similarity evaluation in image analysis International Journal of Intelligent Systems, Vol. 8(7), pp. 749-769, 1993.

[D4] Binaghi E., Gagliardi I., Schettini R. Image retrieval using fuzzy evaluation of color similarity International Journal of Pattern Recognition and Artificial Intelligence, Vol 8(4), pp. 945-968, 1994.

[D5] Caelli T., Reye D. On the classification of image regions by colour, texture and shape Pattern Recognition, Vol. 26, pp. 461-470, 1993.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[D6] Dimai A., Stricker M. Spectral covariance and fuzzy regions for image indexing Technical report BIWI-TR-173, Swiss Federal Institute of Technologies, ETH, Zurich, 1996.

[D7] DOCMIX State-of-the-Art and Market Requirements in Europe Electronics image banks, CEE Final report, March 1988, EUR 11736, DG XIII, Jean Monnet Building, L-2920 Luxembourg

[D8] Du Buf J.M.H., Kardan M., Spann M. Texture feature performance for image segmentation Pattern Recognition, Vol. 23, pp. 291-309, 1990.

[D9] Equitz W., Niblack W. Retrieving images from a database: using texture algorithms from the QBIC system IBM Research Division, Research Report 9805, 1994.

[D10] Faloutsos C., Barber R., Flickner M., Hafner J., Niblack W., Petrovic D. Efficient and effective querying by image content, Journal of Intelligent Systems, Vol. 3, pp. 231-262, 1994.

[D11] Finlayson G.D., Chatterjee S.S., Funt B.V. Color Angular indexing, Proc. The Fourth European Conference on Computer Vision (Vol II)", pp. 16-27, European Vision Society, 1996.

[D12] Francos J.M., Meiri A.Z., Porat B. A unified texture model based on a 2-D Word like decomposition, IEEE Trans. on Signal Processing, pp. 2665-2678, 1993.

[D13] Francos J.M., Meiri A.Z., Porat B. Modeling of the texture structural components using 2-d deterministic random field Visual Communication and Image Processing, Vol. SPIE 1666, pp. 554-565, 1991.

[D14] Gagliardi I., Schettini R., A method for the automatic indexing of color images for effective image retrieval, The New Review of Hypermedia and Multimedia, 1997 (submitted).

[D15] Gershon R. Aspects of perception and computation in color vision Computer Vision, Graphics, and Image Processing, Vol. 32, pp. 224-277, 1985.

[D16] Gimel’Farb G.L., Jain A.K. On retrieving textured images from an image database Pattern Recognition, Vol. 29, pp. 1461-1483, 1996.

[D17] Hafner J., Sawhney H.S., Esquitz W., Flickner M., Niblack W. Efficient color histogram indexing for quadratic form distance functions IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI 17, pp. 729-736, 1995.

[D18] Healey G., Wang L. The illumination-invariant recognition of texture in color images J. of Optical Society of America A, Vol. 12, pp. 1877-1883, 1995.

[D19] Kondepudy R., Healey G.Use of invariants for recognition of three-dimensional color textures J. of Optical Society of America A, Vol. 11, pp. 3037-3049, 1994.

[D20] Liu F., Picard R.W. Periodicity, directionality and randomness: Wold features for perceptual pattern recognition MIT Vision and Modeling Lab., Tech. Report #320, 1994.

[D21] Ma W.Y., Manijnath B.S.Texture features and learning similarity Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, San Francisco, CA, 1996.

[D22] Manijnath B.S., Ma W.Y., Texture features for browsing and retrieval of image data, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 18, pp. 837-842, 1996.

[D23] McGill M.J., Salton G. Introduction to modern Information Retrieval, McGraw-Hill, 1983.

[D24] Mehtre B.M., Kankanhalli M.S., Desai Narasimhalu A., Man G.C. Color matching for image retrieval Pattern Recognition Letters, Vol. 16, pp. 325-331, 1995.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[D25] Pentland A, Picard RW Photobook: tools for content-based manipulation of image databases SPIE storage and Retrieval of image and video databases II, II, pp. 34-47, 1994.

[D26] Picard R.W., Minka T.P. Vision texture for annotation Multimedia Systems, No. 3, pp. 3-14, 1995.

[D27] Rao A.R., Lohse G.L. Identifying High Level Features of Texture Perception CVGIP: Graphical Models and Image Processing, Vol. 55(3), pp. 218-233, 1993.

[D28] Rao A.R., Lohse G.L. Towards a texture naming system: identifying relevant dimensions of texture IBM Research Division, Research Report 19140, 1993.

[D29] Rosenfeld A., Wang C-Y, Wu A.Y. Multispectral texture IEEE Trans. on Systems, Man, Cybernetics, Vol. 12, pp. 79-84, 1982.

[D30] Schettini R., Pessina A. Unsupervised classification of complex color texture images Proc. IV IS&T and SID's Color Imaging Conference, Scottsdale, Arizona, pp. 163-166, 1996.

[D31] Smith J. R. and Chang S.-F. Automated Binary Texture Feature Sets for Image Retrieval Proc. I.E.E.E. International Conference on Acoustics, Speech, and Signal Processing (ICASSP),Atlanta, GA, 1996.

[D32] Smith J. R. and Chang S.-F. VisualSEEk: A fully automated content-based image query system, Proc. Fourth International Multimedia Conference, Multimedia 96, Boston (Ma), pp. 87-98, 1996

[D33] Song K.Y., Kittler J., Petrou M. Defect detection in random colour textures, Image and Vision Computing, Vol. 14, pp. 667-683, 1996.

[D34] Stricker M.A., Bounds for the discrimination power of color indexing techniques Proc. SPIE, Vol. 2185, pp. 15-24, 1993.

[D35] Stricker M.A., Orengo M. Similarity of color images Storage and Retrieval for image databases III, Proc. SPIE 2420, pp. 381-392, 1995.

[D36] Sung K-K.- A vector signal processing approach to color MIT Technical Report AIM 1349, 1992.

[D37] Swain M.J. Color Indexing Technical Report n. 360, University of Rochester, Rochester, New York, 1990.

[D38] Tamura H., Mori S., Yamawaky T. Textural Features Corresponding to Visual Perception IEEE Transaction on Systems, Man and Cybernetics, Vol. SMC-8(6), pp. 460-473, 1972.

[D39] Tan T.S.C., Kittler J. Colour texture classification using features from colour histogram Proc. 8th Scandinavian Conf. on Image Analysis, SCIA '93, pp. 807-811, 1993.

[D40] Tan T.S.C., Kittler J. On colour texture representation and classification Proc. 22th Int. Conference on Image Processing, Singapore, pp. 390-395, 1992.

[D41] Tuceryan M., Moment-based texture segmentation Pattern Recognition Letters, Vol. 15, pp. 659-668, 1994.

[D42] Tuceryan M., Jain A.K. Texture analysis Handbook of Pattern Recognition and Computer Vision (Eds. C.H. Chen, L.F. Pau, P.S.P. Wang), pp. 236-276, 1994.

[D43] Wyszecki G., Stiles W.S. Color science: concepts and methods, quantitative data and formulae Wiley, New York, 1982.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

12.5 References E

[E1] J.S. Boreczky and L.D. Wilcox, A Hidden Markov Model framework for video segmentation using audio and image features, in Proceedings of the ICASSP, Vol.6, 1998, pp 3741-44.

[E2] M. Casey, MPEG-7 sound recognition tools, in IEEE Trans. on Circuits and Systems for Video Technology, Vol. 11, No.6, June 2001.

[E3] N. Dimitrova, L. Agnihotri, and G. Wei, Video classification using object tracking, International Journal of Image and Graphics, Special issue on content-based image and video retrieval. 2001.

[E4] N. Dimitrova, H.-J. Zhang, B. Shahraray, I. Sezan, T. Huang, A. Zakhor, Applications of video-content analysis and retrieval. IEEE Multimedia 2002.

[E5] A. Divakaran, Video summarization and indexing using combinations of the MPEG-7 motion activity descriptor and other audio-visual descriptors, in Proc. of IWDC02, Capri, Italy, September.

[E6] A. Hanjalic and L.-Q. Xu, User-oriented affective video content analysis, Proceedings IEEE Workshop on Content-based Access of Image and Video Libraries in conjunction with IEEE CVPR-2001, Kauai, Hawaii USA, December, 2001.

[E7] A. Hauptmann and R. Jin, Video information retrieval: Lessons learned with the Informedia Digital Video Library, in Proc. of IWDC02, Capri, Italy, September.

[E8] Z. Liu, J. Huang, and Y. Wang, Classification of TV programms based on audio information using hidden markov model, in IEEE Workshop Multimedia Signal Processing (MMSP-98), Dec 1998.

[E9] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. K. Wong, Integration of multimodal features for video classification based on HMM, in IEEE Workshop Multimedia Signal Processing (MMSP-99), Sept 1999, pp. 53-58.

[E10]Z. Liu, Y. Wang, and T. Chen, Audio feature extraction and analysis for scene segmentation and classification, in Journal of VLSI Signal Processing, pp. 61-79, Oct 1998.

[E11]S. Pfeiffer, S. Fischer, and W. Effelsberg, Automatic audio content analysis, Proceedings of 4th ACM Multimedia Conference, 18-22 Nov. 1996, pp 21-30.

[E12]S. Quackenbush, A. Lindsay, Overview of MPEG-7 audio, in IEEE Trans. on Circuits and Systems for Video Technology, Vol.11, No.6, June 2001.

[E13]Z. Rasheed and M. Shah, Movie genre classification by exploiting audio-visual features of previews, in Proceedings of ICPR'2002.

[E14]M.J. Roach and J.S.D. Mason, Classification of video genre using audio, Proc. of Eurospeech, 2001.

[E15]M.J. Roach, J.S.D. Mason, and L.-Q. Xu, Video genre verification using both acoustic and visual modes, to appear in Proc. of 5th IEEE Intl Workshop on Multimedia Signal Processing, US Virgin Islands, December 9-11, 2002.

[E16]Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlights for TV baseball programs, in Proc. ACM Multimedia 2000, New York, pp. 105-115.

[E17]C. Saraceno, R. Leonardi, Identification of story units in audio-visual sequences by joint audio and video processing, Proceedings of ICIP'98, Oct. 1998, Vol.1, pp 363-7.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[E18]J.R. Smith, C.-Y. Lin, M. Naphade, P. Natsev, and B. Tseng, Learning concepts from video using multi-modal features. Proc. IWDC02, Capri, Italy, September.

[E19]M. A. Smith and T. Kanade, Video skimming and characterisation through the combination of image and language understanding, Proceedings 1998 IEEE Int’l Workshop on Content-Based Access of Image and Video Database, pp 61-70.

[E20]C.G.M. Snoek and M. Worring, Multimodal video indexing - a review of the state-of-the-art, in Proc. of ICME'2002.

[E21]H. Sundaram, S.-F. Chang, Determining computable scene in films and their structures using audio-visual memory models, in Proc. of ACM Multimedia’2000.

[E22]H. Sundaram and S.-F. Chang, Audio scene segmentation using multiple models, features and time scales, in Proc. ICASSP 2000, Istanbul, Turkey, June 5-9, 2000.

[E23]H. Sundaram and S.-F. Chang, Video scene segmentation using audio and video features, in Proc. ICME 2000, New York, 2000.

[E24]G. Tzanetakis, G. Essl, and P. Cook, Automatic music genre classification of audio signals, Proc. Int’l Symposium on Music Information Retrieval, 2001.

[E25]H. Wang, A. Divakaran, A. Vetro, S.-F. Chang, and H. Sun, Survey of compressed-domain features used in audio-visual indexing and analysis. Manuscript Submitted.

[E26]Y. Wang, Z. Liu and J. Huang, Multimedia content analysis using both audio and visual clues, in IEEE Signal Processing Magazine, Vol. 17, No. 6, pp. 12-36, Nov. 2000.

[E27]E. Wold, T. Blum, D. Keislar, and J. Wheaton, Content-based classification, search, and retrieval of audio, IEEE Multimedia Magazine, vol.3, pp 27-36, 1996.

[E28]T. Zhang and C.-C. Kuo, Hierarchical classification of audio data for archiving and retrieving, in Proc. of ICASSP’97, Vol.6, pp 3001-4.

[E29]T. Zhang and C.-C. Kuo, Heuristic approach for generic audio segmentation and annotation, in Proc. of ACM Multimedia’99, pp 67-76.

12.6 References F[F1] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” International Journal of Computer Vision, vol. 1, pp. 321–332, 1988.[F2] L. Cohen, “On active contour models and balloons,” Computer Vision, Graphics and Image Processing : Image Understanding, vol. 53, pp. 211–218, march 1991.[F3] V. Caselles, R. Kimmel, and G. Sapiro, “Geodesic active contours,” International Journal of Computer Vision, vol. 22, no. 1, pp. 61–79, 1997.[F4] L. Cohen, E. Bardinet, and N. Ayache, “Surface reconstruction using active contour models,” in SPIE Conference on Geometric Methods in Computer Vision, San Diego, CA, 1993.[F5] R. Ronfard, “Region-based strategies for active contour models,” International Journal of Computer Vision, vol. 13, no. 2, pp. 229–251, 1994.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[F6] A. Chakraborty, L. Staib, and J. Duncan, “Deformable boundary finding in medical images by integrating gradient and region information,” IEEE Transactions on Medical Imaging, vol. 15, pp. 859–870, 1996.[F7] S. Zhu, T.S. Lee, and A. Yuille, “Region competition: unifying snakes, region growing, and bayes/MDL for multiband image segmentation,” in International Conference on Computer Vision, 1995, pp. 416–423.[F8] N. Paragios and R. Deriche, “Geodesic active regions for motion estimation and tracking,” in International Conference on Computer Vision, Corfu Greece, 1999.[F9] N. Paragios and R. Deriche, “Geodesic active regions and level set methods for supervised texture segmentation,” International Journal of Computer Vision, vol. 46, no. 3, pp. 223, 2002.[F10] A. Yezzi, A. Tsai, and A. Willsky, “A statistical approach to snakes for bimodal and trimodal imagery,” in IEEE International Conference on Computer Vision (ICCV), 1999.[F11] C. Chesnaud, P. Réfrégier, and V. Boulet, “Statistical region snake-based segmentation adapted to different physical noise models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, pp. 1145–1156, nov. 1999.[F12] S. Zhu and A. Yuille, “Region competition: unifying snakes, region growing, and bayes/MDL for multiband image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, pp. 884–900, september 1996.[F13] C. Samson, L. Blanc-Féraud, G. Aubert, and J. Zerubia, “A level set model for image classification,” International Journal of Computer Vision, vol. 40, no. 3, pp. 187–197, 2000.[F14] T. Chan and L. Vese, “Active contours without edges,” IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 266–277, 2001.[F15] E. Debreuve, M. Barlaud, G. Aubert, and J. Darcourt, “Space time segmentation using level set active contours applied to myocardial gated SPECT,” IEEE Transactions on Medical Imaging, vol. 20, no. 7, pp. 643–659, july 2001.[F16] O. Amadieu, E. Debreuve, M. Barlaud, and G. Aubert, “Inward and outward curve evolution using level set method,” in International Conference on Image Processing, Kobe, Japan, 1999.[F17] J. Sokolowski and J.-P. Zolésio, Introduction to shape optimization. Shape sensitivity analysis., vol. 16 of Springer Ser. Comput. Math., Springer-Verlag, Berlin, 1992.[F18] M.C. Delfour and J.-P. Zolésio, Shapes and geometries, Advances in Design and Control. Siam, 2001.[F19] S. Jehan-Besson, M. Barlaud, and G. Aubert, “DREAMS: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation,” International Journal of Computer Vision, vol. 53, no. 1, pp. 45–70, 2003.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[F20] G. Aubert, M. Barlaud, O. Faugeras, and S. Jehan-Besson, “Image segmentation using active contours: Calculus of variations or shape gradients ? ,” SIAM Applied Mathematics, To Appear, 2003.[F21] M Gastaud, M Barlaud, and G Aubert, “Tracking video objects using active contours,” in WMVC, Orlando, FL, 2002, pp. 90–95.[F22] Y. Chen, H. D. Tagare, S. Thiruvenkadam, F. Huang, D. Wilson, K. S. Gopinath, R. W. Briggs, and E. A. Geiser, “Using prior shapes in geometric active contours in a variational framework,” International Journal of Computer Vision, vol. 50, no. 3, pp. 315–328, December 2002.[F23] D. Cremers and F. Tischhäuser and J. Weickert and C. Schnörr, “Diffusion snakes: Introducing statistical shape knowledge into the mumford-shah functional,” International Journal of Computer Vision, vol. 50, no. 3, pp. 295–313, December 2002.[F24] P. Charbonnier and O. Cuisenaire, “Une étude des contours actifs : modèles classique, géométrique et géodésique,” Tech. Rep. 163, Laboratoire de télécommunications et télédétection, Université catholique de Louvain, Louvain-la-neuve, Belgique, 1996.[F25] S. Osher and J. A. Sethian, “Fronts propagating with curvature-dependent speed: Algorithms based on Hamilton-Jacobi formulations,” J. Comput. Phys., vol. 79, pp. 12–49, 1988.[F26] G. Barles, “Remarks on a flame propagation model,” Tech. Rep. 464, Projet Sinus, INRIA Sophia Antipolis, Sophia Antipolis, France, 1985.[F27] J. Gomes and O.D. Faugeras, “Reconciling distance functions and level sets,” Journal of Visual Communication and Image Representation, vol. 11, pp. 209–223, 2000.[F28] P. Thevenaz, T. Blu, and M. Unser, “Interpolation revisited,” in IEEE Transactions on medical imaging, july 2000, vol. 19.[F29] M. Jacob, T. Blu, and M. Unser, “A unifying approach and interface for spline-based snakes,” in SPIE Int. Symp. on Medical Imaging: Image Processing (MI’2001), San Diego CA, USA, February 19-22 2001, vol. 4322, pp. 340–347, Part I.[F30] F. Precioso and M. Barlaud, “B-spline active contours with handling of topology changes for fast video segmentation,” Eurasip Special issue: Image analysis for multimedia interactive services, 2002.[F31] F. Precioso and M. Barlaud, “Regular b-spline active contours for fast video segmentation”, in International Conference on Image Processing, Rochester, NY, 2002.[F32] M. Unser, A. Aldroubi, and M. Eden, “B-spline signal processing: Part i-theory,” IEEE Transactions on Signal Processing, vol. 41, no. 2, 1993.[F33] F. Precioso, M. Barlaud, T. Blu, and M. Unser, “Smoothing b-spline active contour for fast and robust image and video segmentation,” in International Conference on Image Processing, Barcelona, Spain, 2003.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[F34] S. Jehan-Besson, M. Barlaud, and G. Aubert, “Video object segmentation using eulerian region-based active contours,” in International Conference on Computer Vision, Vancouver, Canada, 2001.[F35] M. Gastaud and M. Barlaud, “Video segmentation using region based active contours on a group of pictures,” in International Conference on Image Processing, 2002.[F36] S. Soatto and A. J. Yezzi, “Deformotion: Deforming motion, shape average and the joint registration and segmentation of images,” in European Conference on Computer Vision, 2002.[F37] M. Gastaud, M. Barlaud, G. Aubert: "Tracking video objects using active contours and geometric priors", in 4th European Workshop on Image Analysis for Multimedia International Services, pp 170-175, London UK, 2003.[F38] S. Jehan-Besson, M. Barlaud, G. Aubert, O. Faugeras "Shape Gradients for Histogram Segmentation using Active contours", ICCV 03 Nice.[F39] S. Jehan-Besson, M. Barlaud, G. Aubert, "DREAM²S: Deformable Regions driven by an Eulerian Accurate Minimization Method for image and video Segmentation", ECCV Copenhague Mai 2002 [F40] E. Debreuve, M. Barlaud, G. Aubert, J. Darcourt, " Space time segmentation using level set active contours applied to myocardial gated SPECT”, IEEE Transactions on Medical Imaging, vol. 20, (7), pp 643-659, July, 2001. [F41] F. Precioso and M. Barlaud, “B-spline Active Contours for Fast Video Segmentation”, in 3th European Workshop on Image Analysis for Multimedia International Services Tampere, Finland, May, 2001.

12.7 References G

[G1] wwwqbic.almaden.ibm.com/

[G2] www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English

[G3] www.cobion.com

[G4] www.dino-online.de

[G5] www.abacho.de

[G6] www.freenet.de

Deliverable 2.1

http://www.freenet.de/

http://www.abacho.de/

http://www.dino-online.de/

http://www.cobion.com/

http://www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English

http://wwwqbic.almaden.ibm.com/

http://www.i3s.unice.fr/~gastaud/Publis/Gastaud_2003_WIAMIS.pdf

http://www.i3s.unice.fr/~gastaud/Publis/Gastaud_2003_WIAMIS.pdf

SCHEMA IST-2001-32795 05/06/2003

[G7] www.virage.com

[G8] www.convera.com

[G9] www.morphosoft.com

[G10] www.evisionglobal.com

[G11] www.ltutech.com

[G12] www.ltutech.com/Clients.htm

[G13] www.lanternamagica.com

[G14] www.tecmath.de

[G15] www.pictron.com

[G16] www.aliope.com

12.8 References H

[H1] S.F. Chang, The holy grail of content-based media analysis, IEEE Multimedia, Vol. 9, pp. 6-10, Apr.-June 2002.

[H2] Di Zhong and Shih-Fu Chang, Structure Analysis of sports Video Using Domain Models, Proc. ICME'2001, pp. 920-923, Aug. 2001, Tokyo, Japan.

[H3] T. Zhang and C.-C. Jay Kuo, Audio content analysis for online audiovisual data segmentation and classification, IEEE Trans. on Speech and Audio Processing, Vol. 9, pp. 441-457, 2001.

[H4] Y. Wang, Z. Liu and J.C. Huang, Multimedia Content Analysis Using Audio and Visual Information, IEEE Signal Processing Magazine, Vol. 17, pp. 12-36, 2000.

[H5] C.G.M. Snoek and M. Worring, Multimodal video indexing: a review of the state-of-the-art, ISIS Technical Report Series, Vol. 2001-20, Dec. 2001.

[H6] Y. Gong, L.T. Sin, C.H. Chuan, H. Zhang and M. Sakauchi, Automatic parsing of TV soccer programs, Proc. ICMCS'95, May 1995, Washington DC, USA.

[H7] D. You, B.L. Yeo, M. Yeung and G. Liu, Analysis and presentation of soccer highlights from digital video, Proc. ACCV 95, Dec. 1995, Singapore.

[H8] P. Xu, L. Xie, S-F Chang, A. Divakaran, A. Vetro and H. Sun, Algorithms and System for Segmentation and Structure Analysis in Soccer Video, Proc. ICME'2001, pp. 928-931, Aug. 2001, Tokyo, Japan.

[H9] L. Xie, S.F. Chang, A. Divakaran and H. Sun, Structure Analysis Of Soccer Video With Hidden Markov Models, Proc. ICASSP'2002, May 2002, Orlando, FL, USA.

[H10] A. Bolzanini, R. Leonardi and P. Migliorati, Semantic Video Indexing Using {MPEG} Motion Vectors, Proc. EUSIPCO'2000, pp. 147-150, Sept. 2000, Tampere, Finland.

Deliverable 2.1

http://www.pictron.com/

http://www.tecmath.de/

http://www.lanternamagica.com/

http://www.ltutech.com/Clients.htm

http://www.ltutech.com/

http://www.evisionglobal.com/

http://www.morphosoft.com/

http://www.convera.com/

http://www.virage.com/

SCHEMA IST-2001-32795 05/06/2003

[H11] A. Bolzanini, R. Leonardi and P. Migliorati, Event Recognition in Sport Programs Using Low-Level Motion Indices, Proc. ICME'2001, pp. 920-923, Aug. 2001, Tokyo, Japan.

[H12] R. Leonardi, P. Migliorati and M. Prandini, Modeling of Visual Features by Markov Chains for Sport Content Characterization, Proc. EUSIPCO'2002, Sept. 2002, Toulouse, France.

[H13] R. Leonardi, P. Migliorati, Semantic indexing of multimedia documents, IEEE Multimedia, Vol. 9, pp. 44-51, Apr.-June 2002.

[H14] R. Leonardi, P. Migliorati and M. Prandini, A Markov Chain Model for Semantic Indexing of Sport Program Sequences, Proc. WIAMIS'03, Apr. 2003, London, UK.

[H15] V. Tovinkere and R. J. Qian, Detecting Semantic Events in Soccer Games: Toward a Complete Solution, Proc. ICME'2001, pp. 1040-1043, Aug. 2001, Tokyo, Japan.

[H16] A. Ekin and M. Tekalp, Automatic Soccer Video Analysis and Summarization, Proc. SST SPIE03, Jan. 2003, CA, USA.

[H17] T. Kawashima, K. Takeyama, T. Iijima and Y. Aoki, Indexing of baseball telecast for content based video retrieval, Proc. ICIP'98, pp. 871-874, Oct. 1998, Chicago, IL., USA.

[H18] Y. Rui, A. Gupta and A. Acero, Automatically extracting highlights for TV Baseball programs, Proc. ACM Multimedia 2002, pp. 105-115, 2000, Los Angeles, CA, USA.

[H19] P. Chang, M. Han and Y. Gong, Extract Highlights from Baseball Game Video with Hidden Markov Models, Proc. ICIP'2002, pp. 609-612, Sept. 2002, Rochester, NY.

[H20] M. Petrovic, V. Mihajlovic, W. Jonker and S. Djordievic-Kajan, Multi-modal Extraction of Highlights from TV Formula 1 Programs, Proc. ICME'2002, Aug. 2002, Lausanne, Switzerland.

[H21] V. Mihajlovic and M. Petrovic, Automatic Annotation of Formula 1 Races for Content-based Video Retrieval, TR-CTIT-01-41, Dec. 2001.

[H22] M. Petkovic, W. Jonker and Z. Zivkovic, Recognizing strokes in tennis videos using hidden markov models, Proc. Intl. Conf. on Visualization, Imaging and Image Processing, Marbella, Spain, 2001.

[H23] W. Zhou, A. Vellaikal and C.-C Jay Kuo, Rule based video classification system for basketball video indexing, Proc. ACM Multimedia 2000, Dec. 2002, Los Angeles, CA, USA.

[H24] D.D. Saur, Y.P. Tan, S.R. Kulkarni and P.J. Ramadge, Automated Analysis and annotation of basketball video, SPIE Vol. 3022, Sept. 1997.

[H25] G. Sudhir, J.C.M. Lee and A.K. Jain, Automatic Classification of Tennis Video for High-Level Content-Based Retrieval, IEEE Multimedia, 1997.

Deliverable 2.1

SCHEMA IST-2001-32795 05/06/2003

[H26] S. Lefevre, B. Maillard and N. Vincent, 3 classes segmentation for analysis of football audio sequences, Proc. ICDSP'2002, July 2002, Santorin, Grece.

[H27] M. Bertini, C. Colombo and A. Del Bimbo, Automatic Caption Localization in Videos Using Salient Points, Proc. ICME'2001, pp. 69-72, Aug. 2001, Tokyo, Japan.

[H28] H. Pan, B. Li and M.I. Sezan, Automatic detection of replay segments in broadcast sports programs by detection of logos in scene transition, Proc. ICASSP'2002, May 2002, Orlando, FL, USA.

[H29] H. Pan, P.V. Beek and M.I. Sezan, Detection of Slow-Motion Replay Segments in Sports Video for Highlights Generation, Proc. ICASSP'2001, May 2001, Salt Lake City, USA.

[H30 Mei Han, Wei Hua, Wei Xu and Yihong Gong , An integrated Baseball Digest System Using Maximum Entropy Method, Proc. ACM Multimedia 2002, Dec. 2002, Juan Les Pins, France.

[H31] Martin L. Puterman, Markov Decision Processes, Wiley, New York, 1994.

End of Report

Deliverable 2.1