deliverable - · pdf fileplace during the preparation of the project as well as the very ......

Innovative e-environments for Research on Cities and the Media

DeliverableD7.1 Report on State of the Art for moving image analysis

Deliverable lead: FHG

Deliverable due date: 31 July 2016

Actual submission date: 05 August 2016

Version: 1.3

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under

grant agreement No 693559.


Title Report on State of the Art for moving image analysis

Creator Christian Weigel

Description This deliverable provides a review of the state of the art on meth-ods for automatic moving image analysis. A selection of thesemethods and their further development within the project willcomprise the foundation of technology framework of I-Media-Cities. The document covers the four main topics of “Objectdetection and Classification”, “Temporal Video Segmentation”,“Video Motion Estimation” and Video Quality Estimation. Wediscuss what the methods can do for I-Media-Cities and how theirresults can contribute to the project.

Publisher I-Media-Cities Consortium

Contributors Christian Weigel, Alexander Loos, Ronny Paduschek

Creation date 5 August 2016

Language en-GB

Rights © 2016 I-Media-Cities Consortium

Audience public

Review status DraftWP leader acceptedTechnical Manager acceptedCoordinator accepted

Action requested to be revised by Partnersfor approval by the WP leaderfor approval by the Technical Committeefor approval by the Project Coordinator

Requested deadline

|Page| 1 I-Media-Cities


Contents1 Executive Summary 3

2 Object Detection and Classification 42.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 State-of-the-art techniques for object detection and recognition . . . . . . . . . . . . . . . . . 5

3 Temporal Video Segmentation 113.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Methods of Temporal Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Video Motion Estimation 134.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Local vs. Global Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3 Methods for Global Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Video Quality Estimation 175.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Quality control approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2.1 Production related video errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2.2 Analogue errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2.3 Transmission errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2.4 Video compression artefacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 I-Media-Cities related tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Conclusion 22



1 Executive SummaryThis deliverable provides a review of the state of the art on methods for automatic moving image analysis. Aselection of these methods and their further development within the project will comprise the foundation oftechnology framework of I-Media-Cities. The document covers the four main topics of “Object detection andClassification” (section 2) “Temporal Video Segmentation” (section 3), “Video Motion Estimation” (section 4)and Video Quality Estimation (section 5). The choice of these topics originates from discussions that tookplace during the preparation of the project as well as the very early project meetings. To a large extend,they cover the topics that have been listed in Task 7.1 of work package 7. The topic of “video identification /fingerprinting” is not covered while we added the section video motion estimation which also includes visualconcept classifications w.r.t. to video events. Early discussion within the project about the content we aregoing to use lead to the conclusion that this shift is useful and might better represent the needs within I-Media-Cities. However, the requirement process of the project just started and final decisions on which technologywill be developed and incorporated into the IMC-Framework will be made based on this process. Based on theexisting background and the partners expertise a shift of focus on these subjects would still be possible. At thispoint we want to emphasise that this document is just an overview of methods which could be applied to theI-Media-Cities technology framework. Obviously, only a small potion of these methods can be implemented orpractically evaluated for I-Media-Cities. It’s rather a first narrowing down of what would be useful and possiblein terms of automatic moving image analysis.

All the algorithms mentioned in the following section basically have one aim. To (semi-)automatically enrichmoving image content by annotating it at different levels of semantic. While some algorithm might only producesimple data or even data that is not directly usable by humans due to is sheer amount or complexity, othersgive rich meta data on a semantically high level. However, we did the selections of the topics below always withthe projective objective in mind provided that the data produced is usable directly, after an additional processingstep, or due to smart visualization or additional annotation by humans.

The purpose of this deliverable is twofold. Firstly due to the broad nature of I-Media-Cities (Non-technicalexperts in archiving and Historical research meet technology experts and developer) it is intended to give aunderstandable overview of what the technology in the fields mentioned above can contribute to the projectand also what it cannot. Each section contains a rather non-technical executive summary or section thatprovides that information on a non-technical level. The second purpose is to collect and structure the technicalState of the Art and the ongoing research work in these fields of computer vision that is changing pretty fastly.This serves as a basis for the technical partner of the project to extend, adapt and improve their existingtechnologies in order to become valuable for I-Media-Cities.

A requirement process is always a hen-egg-problem: While the users says “Show me what you can and Ishow you what I need” the developers go “Tell me what you need and I check what I can do” Therefore westarted early with this document. The method and algorithm therein are a huge selection of what could bedone in the respective field and explain on a technical as well as a non-technical level. As mention above, onlya prioritised selection of these methods will be used in I-Media-Cities. The final decision will be made based onrequirement process that has just been started.



2 Object Detection and Classification2.1 Executive SummaryAutonomous detection of general objects in images or short video sequences is a crucial component withinthis project. In the past 20 years a large and growing amount of literature describing techniques for automaticobject detection has been published.

Object detection aims at determining the locations and sizes of general objects such as

• faces

• vehicles (cars, bicycles, trains, ...)

• persons or

• houses

just to name a few. Automatic approaches for object detection often build the basis of subsequent tasks suchas tracking, object classification or individual identification. Many approaches that either detect parts of objects(e.g. faces) or complete objects can be found in the literature.

Although object recognition, i.e. automatic classification of detected objects into predefined categories(e.g. car, house, person), is often referred to as a separate field of research, object localization and objectclassification are in fact very much related. One could argue that for object recognition in a given image onehas to detect all objects (a restricted class of objects depending on the dataset), localize them with a boundingbox and categorize that bounding box with a label. For object detection, however, one only has to differentiatebetween two classes (usually the object of interest, e.g. persons, and the background), and estimate boundingboxes around each object of interest. However, the task of object classification then becomes trivial: One justhas to train multiple object detectors - one for each object category - and classification can then just be doneby evaluating which detector fires at a certain position. In modern deep learning-based algorithms, however,multiple object classes are trained at once and thus object detection and classification can be performed byone single unified framework.

Figure 1 shows some examples for object detection (1(a) and 1(b)) by means of the Deformable Part Models(DPM) algorithm presented in [Fel+10] and for object recognition (1(c) and 1(d)) by a recently developed deep-learning framework called YOLO [Red+15].

The following section gives an overview of state-of-the-art systems for automatic object detection.



(a) (b)

(c) (d)

Figure 1: This image illustrated the difference between object detection and object recognition. Althoughsometimes differentiated in the literature, both tasks are strongly connected. Figures (a) and (b) show exampledetections of persons and bicycles by DPM [Fel+10] while Figures (c) and (d) show object recognition resultsby YOLO [Red+15] for multiple object classes. Note that for DPM a separate model for each object categoryneeds to be trained while for YOLO the framework can be trained for multiple object classes simultaneously.

2.2 State-of-the-art techniques for object detection and recognitionAutomatic object detectors can be classified into six main categories: Template Matching, Rigid Object Detectors,Local Keypoint Detectors, Model-Based Detectors, Deep-Learning-Based Detectors, and Motion-Based Detectors.

Template Matching The simplest detectors use template matching to localize certain objects of a specificcategory in images or videos. One prototypical instance of the object’s visual appearance is used to detectedthe object of interest. A template image is typically generated by taking either only one or the mean ofmany example images that best represent the desired object. The template is then shifted across the sourceimage. By calculating certain similarity metrics such as normalized cross-correlation or intensity differencemeasures [Cox95] the object location is defined as the area corresponding to the highest similarity score.

Although template matching algorithms might be fast and easy to implement, they only achieve adequateresults in rather controlled settings and perform poorly in natural real-world environments due to clutteredbackground and geometrical deformations of the object to be detected. To achieve invariance of the methodagainst object deformations different scales and rotations must be applied which significantly increases pro-



cessing time. Hence, more sophisticated object localization algorithms must be applied to automatically detectgeneric objects in real-world environments.

Rigid Object Detectors A more advanced group of detectors are called rigid object detectors. As thename suggests, these kind of detectors are limited to non-deformable objects with similar shape. Visual de-scriptors and the spatial relationship between them are utilized to localize rigid objects, where features varyamong instances but their spatial configuration remains similar. For human face and pedestrian detection, rigidobject detectors have been used for over a decade. Although Rowley et al. [RBK98] already achieved promisingresults with a neural-network based human face detection system in 1998, the probably best known algorithmfor real-time object detection was presented by Viola and Jones in 2001 [VJ01]. It uses a boosted cascade ofsimple Haar-like features which exploit local contrast configurations of an object in order to detect a face. TheAdaBoost algorithm [FS99] is utilized for feature selection and learning. Numerous improvements have beenproposed over the last decade to achieve wider pose invariance, most of them relying on the same principlesas suggested by Viola and Jones. For an overview of face detectors the reader is referred to [HL01; YKA02;RRB12]. One interesting example for a face detection and tracking engine which is based on ideas of Violaand Jones is called SHORE and was developed by Ernst and Küblbeck in [KE06]. The authors improved thebasic approach by Viola and Jones significantly by using multiple consecutive classification stages with increasingcomplexity and different illumination invariant feature sets to detect faces in images and video sequences. Eachstage is built out of a feature extraction step and a classifier. One out of three illumination invariant featurescan be applied: edge orientation features, census features [ZW94], and structure features built out of scaledversions of census features. Real-time capability is achieved by using simple and fast pixel-based features in thefirst stages and more sophisticated and therefore more complex descriptors in subsequent stages. Each stageconsists of look-up tables that are build in an offline training procedure using Real-AdaBoost [SS99]. The firststages can be considered as a fast but inaccurate candidate search while the remaining stages focus on slowerbut more accurate classification. For the actual detection the gray scaled input image is initially convolved witha 3×3mean filter kernel to compensate noise. While the detection model is fixed with a size of 24×24 pixels,the mean filtered image is downscaled several times to build a image pyramid. To achieve scale invariance areal-time capable coarse to fine search is applied by shifting the classification window across every pyramidlevel. To ensure real-time performance only candidate face regions which achieve high confidences in the firststages reach the slower but more accurate detection stages.

Later, Dalal and Triggs [DT05] proposed to use Histograms of Oriented Gradients (HOG) and a linearSupport Vector Machine (SVM) for the detection of pedestrians in low-resolution images. They showed thatlocally normalized HOG descriptors provide a powerful robust feature set to reliably detect pedestrians, i.e.humans in upright positions, even in cluttered backgrounds, difficult lighting conditions, and various poses.

Although researchers use rigid object detectors mainly to localize faces, this class of object detectors hasalso been applied to detect other objects as well. Unfortunately, although rigid object detectors can be usedto detect objects which are relatively rigid, the localization of deformable objects in natural scenes requiresmore sophisticated techniques because different body postures would result in different appearances of thesame object.

Model-BasedDetectors To cope with the above mentioned challenges, researchers later proposed model-based methods to reliably detect objects in visual footage. Different feature-sets can be used to create modelsof an animal body such as appearance, texture, shape, color or the combination of those.



Remarkable ideas for a generalized unsupervised object tracker and detector were presented by Ramananand Forsyth in [RF03; RFB05; RFB06]. In [RF03] the authors propose a technique to automatically buildmodels of objects from a video sequence where no explicit supervisory information is necessary and nospecific requirements regarding background are stipulated. Objects are modeled as 2D kinematic chains ofrectangular segments where the topology as well as the number of segments are not known a priori. In a firststep candidate segments are detected using a simple low-level local detector by convolving the image with atemplate which corresponds to parallel lines of arbitrary orientations and scales. This step alone results inmany false positive detections. Therefore, resulting segments are clustered in a second step to identify objectregions that are coherent over time. After pruning away remaining segments that do not fit the motion modelbecause the tracks are to short or move to fast, a coarse spatial model of the remaining segments can beassembled. Because appearance models of objects are generated on-the-fly there are two ways to think aboutthe proposed system. It can either be seen as a generalized tracking system that is capable of modeling objectswhile tracking them or as a source of an appearance model which can later be used to detect objects of thesame kind.

One drawback of this approach is that when building the rough shape model of the object of interestone has to be certain that within a given sequence only one object is present in the scene. Furthermore,the resulting appearance model is very much tuned to the specific object category in the video sequence.Therefore, the same authors try to overcome these drawbacks by extending their work towards an objectrecognition framework to detect, localize, and recover the kinematic configuration of textured objects inreal-world images [RFB05; RFB06]. Ramanan et al. fuse the deformable shape models learned from videosand appearance models of texture from labeled sets of images in an unsupervised manner for that purpose.Although the detection results improve significantly compared to their previous work the detector is designedto only detect highly textured objects.

More recently, deformable part-based models (DPM) were introduced by Felzenszwalb et al. [Fel+10]as generic object detectors achieving state-of-the art results on a variety of object categories in internationalbenchmarks such as the PASCAL VOC 2010 [Eve+10] at that time. Whilst the majority of methods forobject detection depict the object of interest as a whole, the main idea of DPMs is to represent objects asflexible configurations of feature-based part appearances. Support Vector Machines (SVM) are used to learnthe set of appearance detectors and part alignments. While performing very well on some object categories,DPMs are known to be sensitive to occlusion and highly deformable object categories such as animals. Parkhiet al. [Par+11] extended the ideas of DPM to distinctive part models (DisPM). They propose to initially utilizeDPMs to detect distinctive regions in the first phase. Secondly, the whole object is subsequently found bylearning object specific features such as color or texture from the initially detected image region. After coarseforeground-background separation based on the learned low-level image features, graph cuts [BV01] are usedfor the final segmentation of the object.

A variety of other extensions of the basic DPM approach including so-called grammar models were inventedsubsequently [GFM11]. The main idea of grammar models is that objects are represented in terms of otherobjects through compositional rules. Deformation rules allow for the parts of an object to move relative toeach other, leading to hierarchical deformable part models. Structural variability provides choice betweenmultiple part subtypes - effectively creating mixture models throughout the compositional hierarchy - and alsoenables optional parts. In this formalism, parts may be reused both within an object category and across objectcategories.



Local Keypoint Detectors A large and growing body of literature investigated template matching, rigidobject detectors as well as model-based approaches to detect objects in images or videos. However, theyoften perform poorly when detecting objects in real-world environments due to a wide variety of postures,lighting conditions and partial occlusion. Tiny regions on the surface of objects often exhibit less deformationthan the entire object [KB13]. Therefore, new approaches for fast, robust, and reliable detection and descrip-tion of regions or local interest points such as the Harris Corner Detector [HS88], Scale Invariant FeatureTransform (SIFT) [Low04] or Speeded Up Robust Features (SURF) [Bay+08] have been developed in the past.Furthermore, most keypoint descriptors offer the possibility of subsequent matching which can serve as basisfor not only object detection but object recognition.

Although local keypoint detectors are capable of robustly detecting and describing points of interest incertain scenarios, they disregard important global information such as object structure, spatial relationshipbetween keypoints, and temporal information available in video sequences.

Deep-Learning-Based Detectors Recently developed deep learning techniques have revolutionized thefield of computer vision and particularly object detection and classification. Especially Convolutional NeuralNetworks (CNN) have gained much attention in the computer vision community and have thus been usedfor a variety of different image and video processing tasks such as object detection and recognition as well asimage classification among others. CNNs are biologically-inspired variants of Multi-Layer-Perceptrons (MLPs).Thus, a CNN is a type of feed-forward artificial neural network in which the connectivity pattern between itsneurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged insuch a way that they respond to overlapping regions tiling the visual field.

One of the first attempts to utilize CNNs for object detection called Regions with CNN features (R-CNN)was published by Girshick et al. in 2014 [Gir+14]. The CNN localization problem is solved by operatingwithin the ”recognition using regions” paradigm, which means that the input images is divided into N × Noverlapping blocks and for each region the system estimates the probability that it belongs to a certain class.The proposed object detection system consists of three modules. The system (1) takes an input image, (2)extracts around 2000 bottom-up region proposals over several scales, (3) computes features for each proposalusing a large CNN, and then (4) classifies each region using class-specific linear SVMs. The authors proposeda simple and scalable detection algorithm that improved mean average precision (mAP) by more than 30%relative to the previous best result on the PASCAL Visual Object Classes (VOC) Challenge, a widely knowncompetition for object recognition algorithms. Although the proposed R-CNN algorithm achieved excellentdetection accuracies, notable drawbacks can be found. Most notably, R-CNN is slow both in training and testtime because it performs a CNN forward pass for each object proposal, without sharing computation.

Thus, a number of extensions have been proposed in the recent past to overcome these limitations. Fast R-CNN [Gir15] for instance employs several innovations to improve training and testing speed while at the sametime increasing detection accuracy. While R-CNN takes about 47 seconds per image on a GPU, Fast R-CNNis about 213 times faster. Speed-Up is achieved by replacing the SVM with fully connected layers. Thus thearchitecture can be trained end-to-end. Furthermore, fully connected layers are accelerated by compressingthem with a truncated version of the Singular Value Decomposition (SVD) [Cai+14]. Fast R-CNN achievesnear real-time rates using very deep networks when ignoring the time spent on region proposals which actuallyturned out to be the bottleneck at test-time.

To overcome this issue, Faster R-CNN was proposed in 2015 by Ren et al. in [Ren+15]. The authors in-troduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detectionnetwork, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that si-



multaneously predicts object bounds and object scores at each position. The RPN is trained end-to-end togenerate high-quality region proposals, which are used by Fast R-CNN for detection. Furthermore, RPN andFast R-CNN are fused into a single network by sharing their convolutional features. The proposed detectionsystem has a frame rate of 5 frames per second (including all steps) on a GPU, while achieving state-of-the-artobject detection.

In the same year, a unified, real-time capable object detection framework named YOLO (You Only LookOnce) was presented in [Red+15]. Prior work on object detection repurposes classifiers to perform detection.Instead, within YOLO object detection is framed as a regression problem to spatially separated bounding boxesand associated class probabilities. A single neural network predicts bounding boxes and class probabilitiesdirectly from full images in one evaluation. Since the whole detection pipeline is a single network, it can beoptimized end-to-end directly on detection performance. Compared to previous deep-learning frameworks forobject detection such as R-CNN and its variants, YOLO is extremely fast. The base YOLO model processesimages in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes anastounding 155 frames per second while still achieving double the mean average precision of other real-timedetectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is lesslikely to predict false positives on background.

Motion-Based Detectors Although the above mentioned approaches achieve sufficient results under cer-tain conditions, they lack in taking full advantage of the spatio-temporal information in video sequences todetect moving objects. Based on the assumption that highly occluded or even camouflaged objects can bedetected by taking their movement in front of relatively stationary backgrounds into account, a considerableamount of literature has been published on motion-based detectors. Object detection can be achieved bybuilding a representation of the scene called the background model and then finding deviations from the modelfor each incoming frame. Any significant change in an image region from the background model signifies amoving object. The pixels constituting the regions undergoing change are marked for further processing. Thisprocess is referred to as the background subtraction.

The simplest moving object detectors are based on frame differencing which is a pixel-wise subtractionbetween two or three consecutive frames in an image sequence to detect regions corresponding to movingobjects such as human and vehicles. Although frame differencing is very adaptive to dynamic environments sinceonly up to three consecutive frames are considered for background modeling, often holes are developed insidemoving entities and false positive detections often arise for highly dynamic backgrounds or illumination changeswhich occur in real-life scenes. Furthermore, it is often unclear how the threshold for foreground-backgroundsubtraction can be set efficiently.

To overcome these limitations, a vast variety of more sophisticated, background modeling techniques havebeen developed over the past two decades such as Gaussian Mixture Models (GMM) [SG99], Adaptive PoissonMixture Models [FGS11], Wave-Back [PW05], CodeBook [Kim+04], Video Background Extraction [BD11],and Robust Principal Component Analysis (RPCA) [Can+11], just to name a few. For a recent in-depthoverview of traditional as well as recent approaches in background modeling for foreground detection, thereader is referred to the survey by Bouwmans et al. [Bou14].

Statistical approaches such as GMMs for instance, utilize statistics of the distribution of pixels in each frameto distinguish moving objects from constant background. Because the Gaussian distribution updates itself whena new sample comes in, periodic movements by branches or trees can be characterized by the model. Becausesuch a technique alone would result in too many false positive detections due to non-periodic motions, oftentemporal difference filtering is used to estimate the velocity of detected objects in adjacent frames in order to



improve the results.In 2011, a foreground-background separation technique based on Robust Principal Component Analysis

(RPCA) - an algorithm based on Compressed Sensing (CS) theory, a technique for signal measurement andreconstruction - has been developed by Candes et al. in [Can+11]. RPCA splits each frame of a video sequencein a low-rank matrix L which contains pixels of the background and a sparse matrix S representing the fore-ground activity of moving objects. Again, since this procedure still results in a relatively high number of falsepositive detections, a post-processing step such as the Large Displacement Optical Flow algorithm by [BM11],which detects large changes of velocity by incorporating motion information, is necessary for applicationsoperating in real-world scenarios.



3 Temporal Video Segmentation3.1 Executive SummaryThe purpose of temporal video segmentation is the automatic separation of a (usually long) sequence of images(video / movie) into meaningful parts. This separation is useful as a preprocessing step for different kinds offurther automatic processes as well as for visualisation and annotation processes conducted for and by humans.The simplest kind of separation is the separation of the video into boundaries produced by an editing process.The data produced by such algorithms (timestamps but also e.g. thumbnails of the boundary frame) is directlyconsumable by humans. However, visualisations such as time lines accompanied with boundary of key framescan even enhance these results making them even more useful.

The following sections will briefly review some approaches of automatic video segmentation. Some of themare already available while other would need further research and adaptation when required by I-Media-Cities.

3.2 Methods of Temporal Video SegmentationThe increasing amount of multimedia information leads to a greater need of efficient ways to retrieve and tosearch the content of interest. Temporal video segmentation (TVS) can considered as a first step towardsautomatic annotation of digital video for browsing and retrieval [KC01] in order to automatically extract thestructure of videos. TVS comprises a lot of video analysis methods for temporal video feature extraction, e.g.key frame extraction, shot similarity detection, or shot structure analysis in order to analyse the frequency ofconsecutive shot boundaries. All these methods can be used for automatic video annotation, browsing, andretrieval. TVS also has a high relevance in the field of dramatic composition of feature films [RS03],[Lie01].The Text REtrieval Conference Video Retrieval Evaluation (TRECVid1) encourages research in informationretrieval and provides a large collection of test data, a uniform scoring, and the opportunity of comparing theirresults [SOK06]. TRECVid also provides amongst others a shot boundary detection tool for evaluation issuesand comparison to other work in this domain. Following, the most common sub-domains of temporal videosegmentation are introduced.

Video Shot Detection Commencing with the extraction of temporal information by detecting video shot2

boundaries, further temporal and non-temporal video analysis is able to be applied on the enclosed videosegment, defined by the detected video shot boundaries. Video shot detection provides basis for existing videosegmentation methods and can be considered as initial step for realizing further video analysis tasks [Ham09].Since video shot detection is a very meaningful task in the TVS domain, a lot of work following differentapproaches, e.g. pixel-based, feature-based, histogram-based, statistics-based, transform-based, motion-basedmethods, and combinations of them [Nap+98],[Cha+08],[Ham09] has been done. A survey on shot boundarydetection on the related work of the annual TRECVid can be found in [SOD10]. [CGP05] gives a conciselyoverview of features and methods for shot boundary detection.Since video filters or cut effects are used in common post-production processes and thus characterize thecomposition of video segments, automatic video shot detection followed by the detection of gradual transitionsis a very meaningful tool, especially for the feature film domain.

1http://trecvid.nist.gov2A shot can be regarded as a continuous sequence of video frames without any interruption and recorded by a single camera.


http://trecvid.nist.gov


Gradual Transition Detection Similar to shot boundary detection transitions are detected by comparingtwo consecutive video frames by a distance function in order to compare the similarity between a frame pairfollowed by defining thresholds to make a decision [Bes+05]. Further work deals with training processesfollowed by a classification step, using neural networks [LZ01], or a probabilistic based algorithm [Han02].Progressive or gradual transitions can be separated into different types, e.g. dissolves, fades, and wipes. Inaddition to that a lot of research has been done using colour histogram-based, standard deviation-based, andfeature-based methods [ZMM99],[Lie01],[GC07] in order to detect transition types in videos.

Key frame Extraction A key frame can be considered as a selected video frame that represents a videosegment depending on its content and with a high visual semantic importance. A simple key frame can bethe first, last or an arbitrary frame of a shot, however without the consideration of visual semantics. Thesize of a key frame set can be fixed as a known priori, left as an unknown posteriori, estimated by the level ofvisual change, or determined in order to find the appropriate number of key frames before the full extractionis executed [TV07]. In the literature automated detection of key frames is described in different ways: [SA07]uses MPEG-7 colour descriptors [Ohm+03] and texture features locally extracted from key frame regions. Eachframe is described as in terms of higher semantic features using a hierarchical clustering approach. [SG10]and [HCL04] introduce methods in the compressed video domain. Further work on key frame extractionmethods is summarized in [Hu+11].

Shot Similarity and Scene Detection The scene detection is based on the fact that video shotsbelonging to the same scene have high visual semantic relations in order to be grouped into a high-level storyunit (scene). Since a scene is a complex and complicated concept information about film dramaturgy andfilm-production techniques would be helpful in order to adapt these clues to a scene detection algorithm.Scene detection is handled differently in previous research work: [RS03] uses a key frame comparison methodconsidering a similarity measure of a given shot with respect to previous shots. [TZ04] deals with clusteredshots based on background similarity by extracting visual features from selected regions of key frames.[TVD03] uses multi-resolution edge detection followed by a neighbourhood visual coherence measure atshot boundaries which in turn are estimated by taking similarity of colours into account. A broad surveyon scene segmentation and other important issues to video analysis and video retrieval can be found in [Hu+11].



4 Video Motion Estimation4.1 Executive SummaryThe purpose of video motion estimation methods and algorithms is to estimate the motion of and withinan scene of moving images. We can group these methods based on different criteria as Figure 2 shows.Applications for video estimation comprise but are not limited to:

• Video retrieval: Video retrieval aims at automatically finding videos which originate from the samesource but have been modified in some way (re-coding, spatial or temporal transformation, colourchanges etc.). Motions estimation methods provide compact representations (features) of video seg-ments that are sufficient to robustly identify them. The features provided by the camera motion es-timation are not directly usable by a human in terms of there semantics since it usually a lot of datathat only contains the feature coefficients. In I-Media-Citiesvideo retrieval (a.k.a video identification /fingerprinting), is not considered at that early project stage as we already motivated in section 1.

• Video segmentation: We already dedicated the previous section 3 to methods of temporal videosegmentation. We can also utilise motion features as used for video retrieval for segmentation purposes.Instead of finding similarity among different videos we use the features to detect dissimilarities (i.e. shotboundaries) within the same videos. The semantic level of the features is therefore the same as in thevideo retrieval case and not directly usable by humans unless further processing is applied.

• Object tracking: Tracking moving object of a scene is a huge field of computer vision research andcovers many different applications in its own. Examples are surveillance, robotics, human computerinteraction or medical imaging. Instead of the camera motion3 the motion and trajectories of objectswithin the scene are retrieved. The results of such algorithm are of a higher semantic level and usuallydescribe the motion of a simplified, uniquely identified object representation (e.g. point, rectangularregion, contour) over time. It is easy to visualize such results in a video and therefore they are directlyunderstandable, usable and extendable by humans as well as. Object tracking is closely connected toobject detection algorithms as reviewed in section 2. We will cover this topic here only shortly by givingsome references in the next section. We decided like that because object detection and classificationseems to have higher priority in IMC at the time of writing.

• Video event detection/classification: Motion estimation can also be used to detect specific eventswithin a video. The types of events can range from simple events such as “the camera has moved”to complex events such as “a person just gave another person a suitcase”. The more complex theevent is the more difficulty and error prone is the method for detection. In I-Media-Citiesvideo eventdetection might become important as a pre-processing step for semi automatic annotations of specificevents. Especially when artistic camera work is the subject of interest, pre-processing large amountsof video can help to locate specific camera motion patterns that can then be further annotated bya human expert. Another application case could be the classification of video scene in terms visualvideo rhythm (e.g. harmonic vs. fast-paced). However it is important to know, that the higher thesemantic level of classification the more difficulty it is for an algorithm to do correct classifications.Trying to automatically classify concepts where even humans tend to disagree or where even a single

3Though global motion estimation is often done in parallel in order to compensate it before tracking objects.



Figure 2: High level taxonomy of motion estimation methods.

individual changes classifications due to mood or context etc. will much likely fail. The results of videoevent detection or classification are directly understandable and usable by humans. They will be eitherhighlighted or heat-mapped spots on a time line or just give resulting classes for a video or parts thereof.

4.2 Local vs. Global MotionIn movies, documentaries, news casts as well as non-professionally produced films two major types of motionoccur: local motion and global motion. Motion analysis algorithm aim to detect and quantify these kind ofmotions. Local motion is characterized by moving objects in the scene. Depending on the type of objectthe type of motion can significantly differ. Important types of objects are e.g. persons, vehicles, things andthe environment itself (trees etc.). While the first are often in the focus of the scene latter are more in thebackground. Global motion, in contrast, occurs due to the motion of the camera. It can consist of rotational(Pan, Tilt, Roll) as well as translatory (x, y, z) motions and zoom (change of the focal length of the cameralens). Basically, a scene can contain all these kinds of motion in a combined fashion making it more difficulty todetect them.

There exist different levels of video motion estimation. Some detect the motion based on simple (local)image features and average over the whole image, some try to distinguish between local motion and globalmotion. Full fledged approaches try to create a 3D scene and camera model by analysing the motion over time(monoscopic footage) or spatially (stereoscopic footage). These methods could work both ways: unsupervised(i.e. no training is required) or using an classification approach that needs annotated training data.

In this deliverable we will focus on methods for global motion estimation since these seem be most im-portant for I-Media-Citiesand will shortly review the mostly used approaches. These methods could supportthe detection of scene events with respect to camera movements. They could also support classifications ofscenes in terms of artistic camera work.

From what we learned at the very first project meetings we do not considers local motion detection(i.e. object/video tracking) of such high importance for IMC. However, object tracking is almost always usingmethods for object detection of which we already gave an extensive review in section 2. These could thenbe extended by object tracking techniques of which according to [YJS06] there are Point Tracking methods



(e.g. Kalman Filter [BC86], Greedy Optimal Assignment etc. [VRB01]), Kernel Tracking Methods (e.g. KLT[ST94], Mean-shift tracking [CRM03], or more recently adaptive tracking by detection methods such as in[HST11], reviewed in [WLY13]) and Silhouette/Shape tracking methods (e.g. Hough transform based [SA04]or Histogram based [KCM04]).

4.3 Methods for Global Motion EstimationGlobal motion estimation often relies on the calculation of the so called “optical flow” for a temporal sequenceof images. The optical flow describes the apparent motion of parts of a scene caused by relative motion betweenthat scene and the camera. Since the 3D scene has been projected to the 2D camera sensor information aboutthat motion has been lost (aperture problem) and estimation of the flow is therefore an ill-posed problem. Flowestimation algorithms try to solve this problem by introducing additional conditions for estimating the actualflow. Early work on this topic and method for calculation can be found in [HS81; LK81; BA96; SC97]. Firstevaluation have been made by Barron in 1994 [BFB94]. While some of the algorithm aim at calculating the flowsparsely for robust key points of the scene only4 others reconstruct a flow vector for each pixel of the image(dense flow). An framework for comparison for objective comparison of the latest flow estimation algorithmshas been described in [Bak+11]. An example of motion vectors is shown in Fig. XXX für a translatory androtational camera motion.

For global camera motion estimation sparse, robust flow or motions vectors often suffice in order to obtainthe motion vectors for the scene. These are then fed into a further processing step in order to estimate thecamera motion qualitatively (e.g. classifying them into the motion classes, pan, tilt, roll, zoom). The methoddescribed in [SVH97] uses the residual flow in order to estimate pan, tilt, roll and zoom. [SL96] distinguishbetween singular and non-singular flow while while in [XL98] Xiong and Lee directly derive the motion fromthe flow by analysing it in different regions of the image. Zoom detection by calculating the field of expansionis described in [JhYs08]. In [Gen+06] the authors use a machine learning approach with a Support VectorMachine in order to classify the camera motion into pan/tilt and zoom/roll.

4That is, Points of the scene that are clearly distinguishable from their surround and thus good for tracking.



(a) (b)

(c) (d)

Figure 3: Motion/Flow vector visualisation for (a) a camera pan (b) camera tilt, (c) a camera roll and (d) acamera move towards z direction.



5 Video Quality Estimation5.1 Executive SummaryDuring film production related processes, like recording, post-production, archiving, digitization, distributionformat change, transmission, etc., the audio-visual (a/v) essence may be changed in such way that a degradationof the video quality (VQ) would be expected. Apart from the disturbed visual perceptibility, the degradedmaterial including errors (such as analogue and digital artefacts, interferences, interruptions or losses, etc.)56

wouldn’t represent the video content correctly and thus would make further video analysis approaches, asdescribed in Section Object Detection and Classification, Video Motion Estimation, and Temporal Video Seg-mentation, more difficult or impossible. Furthermore, the VQ estimation may help to cross-check incorrectmetadata which have been set unintentional wrong or have been unknown, e.g. during digitization processesor post-production. In order to detect errors in video content there is a need to inspect the degraded ma-terial. This is mostly still done manually, since automated VQ estimation methods do not have the requiredaccuracy. However, automated VQ estimation approaches may specifically support the manual, cost-intensiveand time-consuming quality checks.

VQ estimation can be separated into two basic approaches: (1) VQ estimation with a reference video(original material, usually before encoding) and (2) VQ estimation without any reference. In the further courseof the text we use the term quality control (QC) representing VQ estimation. In I-Media-Citieswe need to applya no-reference QC approach since for most of the material the original source has been an analogue video oreven celluloid film. An overview of recent no-reference QC methods is presented in [Sha+14]. The I-Media-Cities partner Fraunhofer IDMT already provides tools for the detection of visual errors such as (but not limitedto): Vertical and horizontal black bars , Vertical and horizontal picture shifts, Field order detection, Blockinessof a video, Interlace artefacts, Video freezes, Black frames, Blur, Ringing, Halo, Over/under exposure, Visualnoise, and Camera unsteadiness. These tools can be extended or adapted for the needs of the project.

Typically, QC can be applied on digital video formats in different ways: container-based (wrapped) bychecking the integrity of the container, data stream-based (coded), or base-band (decoded, on video pixellevel) analysis. Error detection for stream-based analysis is performed after available information is picked outfrom the header. Base-band analysis is performed on decoded video frames considering intra-frame or inter-frame processing. Base-band approaches are applicable to detect video encoder artefacts, such as blockiness,visual noise, or blurring, but also artefacts which have been caused by different other issues, e.g. analogueerrors or digitization issues. There is a lot of work done using no-reference approaches in order to checkdigital a/v content for different errors.

The result of QC is basically a (normalized) scalr value per video frame representing “how much” a specificerror occurs in that frame. While this is not directly consumable by a human, using visualizations such as heatmap over time can significantly help to understand the results. Since quality is a very subject term it is difficultyto quantify it with automatic algorithms without considering the targeted audience, the application contextand the material that is analysed.

5EBU, Strategic Programme on Quality Control; Last visit: 2016 July, 12th.6Wiki page: A/V artefact Atlas; shows a list of analogue and digital a/v artefacts, Last visit: 2016 July, 13th.


http://ebu.io/qc/items/

http://avaa.bavc.org/


5.2 Quality control approachesMethods of the digital signal analysis can be applied on digital a/v data in order to detect analogue and digitalerrors. Subsequently, a collection of essential error detection methods is presented, however which does notcomprise all research in the QC domain.

5.2.1 Production related video errors

Lightning Imbalance Over or under exposures are non-linear errors caused by unbalanced brightnessdue to an incorrect orientation of camera lens with respect to the ambient lighting. The authors in [YK02]introduce a lightning imbalance detection method. Therefore, the average, minimum and maximum luminanceof a video segment is first computed followed by a decision for over and under exposed video segments usingthresholds which defines the borders between over and under exposed video content.

FilmUnsteadiness Possible causes of film unsteadiness are production related records, but also a defect orworn out perforation of film material, or a faulty film transport mechanism. Film unsteadiness can be detectedthrough changes of global motions (see also section 4 on Video Motion Estimation) in successive video framesand compensated using global motion vectors [For09]. Further methods on how to detect camera shakes orunsteady camera movements are proposed. The authors of [YK02] define a video as shaky, if most objectsin the video frames move back and forth repeatedly along same directions during a short period. In orderto detect camera shakes, a trajectory of shaking vectors which each relate to a defined region inside a videoframe and within a short video sequence is calculated. If the angles of successive vectors are changing in ahigh frequency (e.g. 20 times within a video sequence 30 frames), the video segment is considered as shaky.Another method is described in [Tak+03]. A camera shake is one of 4 types which have been taken underconsideration by this approach. Thereby, camera motion parameters are automatically extracted from Mpegvideo sequences in order to estimate the camera motion transitions and their ratio to each other.

Blooming and Smearing Such artefacts are generated by light sources such as a glared object causing theamount of charge on a charge coupled device (CCD) sensor exceed, which most often still happens with oldervideo cameras. The authors in [CSP13] describe a method for glare region detection which amongst othersrelate to blooming and smearing artefacts. They in a first step use a bilateral filter to globally reduce visualnoise and make layers by quantizing images with the result that false contours appear around light sources.Finally, blooming and smearing are detected by quantifying the similarity of a shape between neighbouring layersthrough distance transform.

Noise Noise is an unwanted effect that most often occurs in cameras that use CCD and CMOS sensors. Afast impulsive noise detection method is introduced in [SC05]. Pixels within a radius r around a defined pixelvalue Pi,j are considered. The intensity and distance of a pixel pair are calculated using thresholds and distancefunctions in order to classify a noisy pixel. The authors in [LTO12] introduce a patch-based Gaussian noisedetection method. Therefore, they use the Principal Component Analysis (PCA) on weak textured images toestimate the noise.

De-Interlacing and Field Order Problems De-Interlace artefacts occur by combining two interlacedsub fields into a single progressive frame, whereas field order problems can occur if interlaced video shots



with different field order are assembled which in turn could affect the video data negatively by unwantedeffects (e.g. ”flashy fields”). The authors in [KPL01] introduce an algorithm for separating interlaced videosfrom progressive content. The authors in [Bay07] present an efficient approach to detect the field order ofinterlaced videos using motion magnitude information.

5.2.2 Analogue errors

In the case of former analogue film productions the measured technical quality of the a/v signal has had a directeffect on the perceived quality of the a/v content. A gradual loss of the a/v signal, e. g. caused by transmissioninterruption or generation loss due to physical effects, often leads to a degradation of the a/v quality.

Tracking error ormis-tracking Amongst others, tracking error is the most common defect if video headscannot correctly follow the video tracks recorded on a tape. A detection and correction method of trackingerrors is proposed in [SAM13].

Blotches and Scratches A blotch is an artefact typically occurring as bright or dark patch on old cel-luloid films. Scratches in film material often do appear as vertical lines across the image. They result fromdust particles in projectors which scratches the surface of the celluloid film during the playback. Both typesof distortions can be detected by analysing temporal correlations using optical flow computations betweensuccessive frames. Such a detection approach is proposed in [Gro06].

Colour Fading A colour fading artefact is caused by a temporal change or loss of one or all colour com-ponents. Since film material consists of multiple layers the outer layer is very sensitive. The non-linear loss ofcolour can be detected by observing the properties of each colour layer [RG01].

Intensity Flicker Intensity flicker is an artefact that appears through temporal brightness variations ofsuccessive video frames. In the literature flicker detection approaches using motion vectors are introducedin [For09], [Zha+09].

5.2.3 Transmission errors

Freezes Frame freezes can be caused by the transmission of video stream over an error-prone channel.E.g. for video broadcasting the compressed video stream is mostly transferred over a packet-switched networkwhich can be lost or which can be delayed to the point where they are not received in time for decoding.A no-reference approach deals with the identification of frozen frames. Therefore, potential frozen framesare identified based on the mean-squared error (MSE) between the current frame and the previous frame.However, it is hard or not possible to distinguish content that is intentionally not moving from frozen framescontent that is not moving as a result of a video impairment (freezes). Frame freezing can also be causedby low-bit rate encoding in case of an encoder buffer overflow, e.g. when the complexity of video contentsuddenly increases [HTG09].



Packet loss Packet loss may happen occasionally or in blocks. Both can result in video impairments andhence significantly degrade the quality of experience (QoE) of video services. [DL10] The proposed packet-lossdetection algorithm in [Tes+10] is based on detecting particular ”sharp” horizontal edges related to blocking-wise artefacts of different size and proportion. The packet-loss measure for a particular decoded video se-quence segment is determined as a square root of the weighted square sum of detected artefacts in each frameseparately.

5.2.4 Video compression artefacts

Video compression artefacts Major digital video standards, e.g. H.261, MPEG-1, MPEG-2 / H.262, H.263,MPEG-4 part 2, H.264/MPEG-4 AVC, etc., rely on linear block transforms such as the discrete cosine transform(DCT). These video codec formats are based on transform blocks (macroblocks), with 8x8, 16x16, or anydifferent sizes7 of luma samples. Early standards used motion compensation relating on entire macroblocks,the transform block size has always been 8x8. Newer formats provide enhanced prediction accuracy [STL04].

Most common compression artefact is blockiness (see Figure 4), which occurs particularly during codingat lower bit rates [Vla00]. Blockiness artefacts in images or video frames are detected in different ways.The authors in [TG00] describe a no-reference blockiness detection method for MPEG-2 coded video usingamplitude and phase of each horizontal harmonic derived from 32x32 blocks. A similar approach on comparingthe ratios obtained from the harmonics after horizontal and vertical blockiness artefact detection is introducedin [Tri10]. A blind blockiness effect measurement algorithm to detect and evaluate the power of a blockysignal is described in [WBE00]. A local no-reference blockiness metric that can automatically and perceptuallyquantify blocking artefacts of DCT coding is presented in [LH08]. Further blockiness measurement approachesfor H.264 encoded videos are presented in [AR10]and [Lop+13].

Methods to detect further compression artefacts, as ringing and blurring are described in [BS05], [LKH08],[LKH10a], and [LKH10b]. Coding errors as mentioned above are also applicable to be detected in images asthe information of only a single frame can be sufficient for their detection.

Figure 4: Video error example, shows a compressed image (left) and detected blockiness artifacts (right) byFraunhofer IDMT’s QC software tools.

7Macroblock transform: Prediction partitions can have seven different sizes, such as 4x4, 8x8, 16x16, 8x4, 4x8, 8x16, and 16x8.



5.3 I-Media-Cities related tasksThe tasks in I-Media-Cities would be twofold:

1. The technical measure always need to be related to a subject perception of quality which is often subjectto the application case or even a specific group of person. The parametrization of the algorithms suchthat the results reflect this quality perception needs to be researched. Therefore, appropriate videomaterial is required as input for our QC algorithms which need to be evaluated on the project specificdata.

2. Due to the specific nature of the content used in I-Media-Cities (old films, videos) new algorithms mightbe required or existing methods need to be improved in order to detect properties and errors, and toaddress issues specific to such content (e.g. digitization issues, black and white film, projector flickering,frame rate issues, noise, etc.).



6 ConclusionWe have reported the state of the art for four topics of moving image analysis that are relevant to I-Media-Cities.Due to the heterogeneous audience of this document, we aimed at explaining and referencing basic and recentresearch in these fields without complex technical formulation. The document is a starting point within theproject. Its intended to bring the consortium closer to the field of computer vision algorithms, what they canand cannot achieve and what output is to be expected from it. We also explained by examples which semanticlevel of annotation is produces by some of the methods. While some produce data on a semantic high level(classification, object region, shot boundaries) some produce data that require more processing or a human inthe loop in order to bridge the semantic gap.

This document is fed into the requirement process that has started with the kick-off meeting in Brusselsend of May 2016. We will continue to work on task 7.1, i.e. practically evaluating algorithms and do first testruns, until the topics have been narrowed down with respect to the requirements process. We will describing,updating and detailing the findings of D7.1 in D7.2 and D7.3 (M24), in function of their tests, training resultsand the needs of the project.



References[AR10] Leonardo Abate and Giovanni Ramponi. “Measurement of the blocking artefact in video frames”.

In: Melecon 2010-2010 15th IEEE Mediterranean Electrotechnical Conference. IEEE. 2010, pp. 218–223.

[BA96] Michael J. Black and P. Anandan. “The Robust Estimation of Multiple Motions: Parametric andPiecewise-Smooth Flow Fields”. In: Computer Vision and Image Understanding 63.1 (1996), pp. 75–104. DOI: 10.1006/cviu.1996.0006.

[Bak+11] Simon Baker et al. “A Database and Evaluation Methodology for Optical Flow”. In: InternationalJournal of Computer Vision 92.1 (2011), pp. 1–31. ISSN: 0920-5691. DOI: 10.1007/s11263-010-0390-2.

[Bay+08] H. Bay et al. “SURF: Speeded Up Robust Features”. In: Computer Vision and Image Understanding110.3 (2008), pp. 346–359.

[Bay07] David M Baylon. “On the detection of temporal field order in interlaced video data”. In: ImageProcessing, 2007. ICIP 2007. IEEE International Conference on. Vol. 6. IEEE. 2007, pp. VI–129.

[BC86] Ted J. Broida and Rama Chellappa. “Estimation of Object Motion Parameters from Noisy Images”.In: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8.1 (1986), pp. 90–99. ISSN:0162-8828. DOI: 10.1109/TPAMI.1986.4767755.

[BD11] O. Barnich and M. Van Droogenbroeck. “ViBe: A Universal Background Subtraction Algorithmfor Video Sequences”. In: IEEE Transactions on Image Processing 20.6 (2011), pp. 1709–1724. ISSN:1057-7149. DOI: 10.1109/TIP.2010.2101613.

[Bes+05] Jesús Bescós et al. “A unified model for techniques on video-shot transition detection”. In: Multi-media, IEEE Transactions on 7.2 (2005), pp. 293–307.

[BFB94] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. “Performance of optical flow techniques”. In: Inter-national Journal of Computer Vision 12.1 (1994), pp. 43–77. DOI: 10.1007/BF01420984.

[BM11] T. Brox and J. Malik. “Large Displacement Optical Flow: Descriptor Matching in Variational MotionEstimation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 33.3 (2011), pp. 500–513.

[Bou14] Thierry Bouwmans. “Traditional and recent approaches in background modeling for foregrounddetection: An overview”. In: Computer Science Review 11–12 (2014), pp. 31 –66. ISSN: 1574-0137.DOI: http://dx.doi.org/10.1016/j.cosrev.2014.04.001. URL: http://www.sciencedirect.com/science/article/pii/S1574013714000033.

[BS05] Rémi Barland and Abdelhakim Saadane. “Reference free quality metric for JPEG-2000 compressedimages”. In: Signal Processing and Its Applications, 2005. Proceedings of the Eighth International Sym-posium on. Vol. 1. IEEE. 2005, pp. 351–354.

[BV01] Y. Boykov and R. Veksler H.and Zabih. “Fast Approximate Energy Minimization via Graph Cuts”.In: IEEE Pattern Analysis and Machine Intelligence 23.11 (2001), pp. 1222–1239.


http://dx.doi.org/10.1006/cviu.1996.0006

http://dx.doi.org/10.1007/s11263-010-0390-2

http://dx.doi.org/10.1007/s11263-010-0390-2

http://dx.doi.org/10.1109/TPAMI.1986.4767755

http://dx.doi.org/10.1109/TIP.2010.2101613

http://dx.doi.org/10.1007/BF01420984

http://dx.doi.org/http://dx.doi.org/10.1016/j.cosrev.2014.04.001

http://www.sciencedirect.com/science/article/pii/S1574013714000033

http://www.sciencedirect.com/science/article/pii/S1574013714000033


[Cai+14] Chenghao Cai et al. “Fast Learning of Deep Neural Networks via Singular Value Decomposition”.In: PRICAI 2014: Trends in Artificial Intelligence: 13th Pacific Rim International Conference on ArtificialIntelligence, Gold Coast, QLD, Australia, December 1-5, 2014. Proceedings. Ed. by Duc-Nghia Phamand Seong-Bae Park. Cham: Springer International Publishing, 2014, pp. 820–826. ISBN: 978-3-319-13560-1. DOI: 10.1007/978-3-319-13560-1_65. URL: http://dx.doi.org/10.1007/978-3-319-13560-1_65.

[Can+11] E.J. Candes et al. “Robust Principal Component Analysis?” In: Journal of the ACM 58.3 (2011).

[CGP05] Costas Cotsaces, M Gavrielides, and Ioannis Pitas. “A survey of recent work invideo shot boundary detection”. In: Proceedings of 2005 Workshop on Audio-Visual Con-tent and Information Visualization in Digital Libraries, available at: http://poseidon. csd. auth.gr/papers/PUBLISHED/CONFERENCE/pdf/Cotsaces05a. pdf. 2005.

[Cha+08] Yuchou Chang et al. “Unsupervised video shot detection using clustering ensemble with a colorglobal scale-invariant feature transform descriptor”. In: Journal on Image and Video Processing 2008(2008), p. 9.

[Cox95] G.S. Cox. Template Matching and Measures of Match in Image Processing. Tech. rep. Department ofElectrical Engineering, University of Cape Town, 1995.

[CRM03] D. Comaniciu, V. Ramesh, and P. Meer. “Kernel-based object tracking”. In: IEEE Transactions onPattern Analysis and Machine Intelligence 25.5 (2003), pp. 564–577. ISSN: 01628828. DOI: 10 .1109/TPAMI.2003.1195991.

[CSP13] Chil-Suk Cho, Joongseok Song, and Jong-Il Park. “Glare region detection in night scene using multi-layering”. In: The Third International Conference on Digital Information Processing and Communications.The Society of Digital Information and Wireless Communication. 2013, pp. 467–469.

[DL10] Qin Dai and Ralf Lehnert. “Impact of packet loss on the perceived video quality”. In: EvolvingInternet (INTERNET), 2010 Second International Conference on. IEEE. 2010, pp. 206–209.

[DT05] N. Dalal and B. Triggs. “Histograms of Oriented Gradients for Human Detection”. In: ComputerVision and Pattern Recognition. 2005.

[Eve+10] M. Everingham et al. The PASCAL Visual Object Classes Chal- lenge 2010 (VOC2010) Results. Tech.rep. 2010.

[Fel+10] P.F. Felzenszwalb et al. “Object Detection with Discriminatively Trained Part Based Models”. In:IEEE Pattern Analysis and Machine Intelligence 32.9 (2010), pp. 1627–1645.

[FGS11] A. Faro, D. Giordano, and C. Spampinato. “Adaptive Background Modeling Integrated With Lu-minosity Sensors and Occlusion Processing for Reliable Vehicle Detection”. In: IEEE Transactionson Intelligent Transportation Systems 12.4 (2011), pp. 1398–1412. ISSN: 1524-9050. DOI: 10.1109/TITS.2011.2159266.

[For09] Guillaume Forbin. “Flicker and unsteadiness compensation for archived film sequences”. PhDthesis. University of Surrey, 2009.

[FS99] Y. Freund and R.E. Schapire. “A Short Introduction to Boosting”. In: Journal of Japanese Society forArtificial Intelligence 15.5 (1999), pp. 771–780.

[GC07] Costantino Grana and Rita Cucchiara. “Linear transition detection as a unified shot detectionapproach”. In: IEEE Transactions on Circuits and Systems for Video Technology 17.4 (2007), p. 483.


http://dx.doi.org/10.1007/978-3-319-13560-1_65

http://dx.doi.org/10.1007/978-3-319-13560-1_65

http://dx.doi.org/10.1007/978-3-319-13560-1_65



http://dx.doi.org/10.1109/TITS.2011.2159266

http://dx.doi.org/10.1109/TITS.2011.2159266


[Gen+06] Yuliang Geng et al. “A Robust and Hierarchical Approach for Camera Motion Classification”.In: Structural, Syntactic, and Statistical Pattern Recognition. Ed. by David Hutchison et al. Vol. 4109.Lecture Notes in Computer Science. Berlin and Heidelberg: Springer Berlin Heidelberg, 2006,pp. 340–348. ISBN: 978-3-540-37236-3. DOI: 10.1007/11815921\_37.

[GFM11] R. Girshick, P. Felzenszwalb, and D. McAllester. “Object Detection with Grammar Models”. In:Proceedings of Advances in Neural Information Processing Systems (NIPS). 2011.

[Gir+14] Ross Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmen-tation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2014.

[Gir15] Ross Girshick. “Fast R-CNN”. In: Proceedings of the International Conference on Computer Vision(ICCV). 2015.

[Gro06] Harald Grossauer. “Inpainting of movies using optical flow”. In:Mathematical Models for Registrationand Applications to Medical Imaging. Springer, 2006, pp. 151–162.

[Ham09] Abdul Hameed. “Video shot detection by motion estimation and compensation”. In: EmergingTechnologies, 2009. ICET 2009. International Conference on. IEEE. 2009, pp. 241–246.

[Han02] Alan Hanjalic. “Shot-boundary detection: unraveled and resolved?” In: Circuits and Systems for VideoTechnology, IEEE Transactions on 12.2 (2002), pp. 90–105.

[HCL04] Yu-Hsuan Ho, Wei-Ren Chen, and Chia-Wen Lin. “A rate-constrained key-frame extractionscheme for channel-aware video streaming”. In: Image Processing, 2004. ICIP’04. 2004 InternationalConference on. Vol. 1. IEEE. 2004, pp. 613–616.

[HL01] E. Hjelmas and B.K. Low. “Face Detection: A Survey”. In: Computer Vision and Image Understanding83 (2001), pp. 236–274.

[HS81] Berthold K.P Horn and Brian G. Schunck. “Determining optical flow”. In: Artificial Intelligence 17.1-3 (1981), pp. 185–203. ISSN: 00043702. DOI: 10.1016/0004-3702(81)90024-2.

[HS88] C. Harris and M. Stephens. “A Combined corner and Edge Detector”. In: Alvey Vision Conference.Manchester, UK, 1988, pp. 147–151.

[HST11] Sam Hare, Amir Saffari, and Philip H. S. Torr. “Struck: Structured output tracking with kernels”.In: 2011 IEEE International Conference on Computer Vision. 2011, pp. 263–270. ISBN: 978-1-4577-1101-5. DOI: 10.1109/ICCV.2011.6126251.

[HTG09] Quan Huynh-Thu and Mohammed Ghanbari. “No-reference temporal quality metric for videoimpaired by frame freezing artefacts”. In: Image Processing (ICIP), 2009 16th IEEE International Con-ference on. IEEE. 2009, pp. 2221–2224.

[Hu+11] Weiming Hu et al. “A survey on visual content-based video indexing and retrieval”. In: Systems,Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 41.6 (2011), pp. 797–819.

[JhYs08] Huang Jing-hua and Yang Yan-song. “Effective approach to camera zoom detection based on visualattention”. In: 2008 9th International Conference on Signal Processing (ICSP 2008). 2008, pp. 985–988. ISBN: 978-1-4244-2178-7. DOI: 10.1109/ICOSP.2008.4697293.

[KB13] Hjalmar S. Kühl and Tilo Burghardt. “Animal Biometrics: Quantifying and Detecting PhenotypicAppearance”. In: Trends in Ecology & Evolution 28.7 (Mar. 2013), pp. 432–441. ISSN: 01695347.DOI: 10.1016/j.tree.2013.02.013.


http://dx.doi.org/10.1007/11815921\_37

http://dx.doi.org/10.1016/0004-3702(81)90024-2

http://dx.doi.org/10.1109/ICCV.2011.6126251

http://dx.doi.org/10.1109/ICOSP.2008.4697293

http://dx.doi.org/10.1016/j.tree.2013.02.013


[KC01] Irena Koprinska and Sergio Carrato. “Temporal video segmentation: A survey”. In: Signal process-ing: Image communication 16.5 (2001), pp. 477–500.

[KCM04] Jinman Kang, I. Cohen, and G. Medioni. “Object reacquisition using invariant appearance model”.In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR). Cambridge and UK,2004, 759–762 Vol.4. DOI: 10.1109/ICPR.2004.1333883.

[KE06] C. Küblbeck and A. Ernst. “Face Detection and Tracking in Video Sequences Using the ModifiedCensus Transformation”. In: Image and Vision Computing 24.6 (2006), pp. 564–572.

[Kim+04] Kyungnam Kim et al. “Background modeling and subtraction by codebook construction”. In: ImageProcessing, 2004. ICIP ’04. 2004 International Conference on. Vol. 5. 2004, 3061–3064 Vol. 5. DOI:10.1109/ICIP.2004.1421759.

[KPL01] Sune Keller, Kim S Pedersen, and François Lauze. “Detecting Interlaced or Progressive Source ofVideo”. In: image 10.I2 (2001).

[LH08] Hantao Liu and Ingrid Heynderickx. “A no-reference perceptual blockiness metric”. In: 2008 IEEEInternational Conference on Acoustics, Speech and Signal Processing. IEEE. 2008, pp. 865–868.

[Lie01] Rainer Lienhart. “Reliable transition detection in videos: A survey and practitioner’s guide”. In:International Journal of Image and Graphics 1.03 (2001), pp. 469–486.

[LK81] Bruce D. Lucas and Takeo Kanade. “An iterative image registration technique with an applicationto stereo vision”. In: Joint Conference on Artificial intelligence. Morgan Kaufmann Publishers Inc. SanFrancisco, 1981, pp. 674–679.

[LKH08] Hantao Liu, Nick Klomp, and Ingrid Heynderickx. “Perceptually relevant ringing region detectionmethod”. In: Signal Processing Conference, 2008 16th European. IEEE. 2008, pp. 1–5.

[LKH10a] Hantao Liu, Nick Klomp, and Ingrid Heynderickx. “A no-reference metric for perceived ringingartifacts in images”. In: IEEE Transactions on Circuits and Systems for Video Technology 20.4 (2010),pp. 529–539.

[LKH10b] Hantao Liu, Nick Klomp, and Ingrid Heynderickx. “A perceptually relevant approach to ringingregion detection”. In: IEEE transactions on image processing 19.6 (2010), pp. 1414–1426.

[Lop+13] JP Lopez et al. “No-reference algorithms for video quality assessment based on artifact evaluationin MPEG-2 and H. 264 encoding standards”. In: Integrated Network Management (IM 2013), 2013IFIP/IEEE International Symposium on. IEEE. 2013, pp. 1336–1339.

[Low04] D.G. Lowe. “Distinctive Image Features from Scale-Invariant Keypoints”. In: International Journalof Computer Vision 60.2 (2004), pp. 91–110.

[LTO12] Xinhao Liu, Masayuki Tanaka, and Masatoshi Okutomi. “Noise level estimation using weak tex-tured patches of a single noisy image”. In: 2012 19th IEEE International Conference on Image Pro-cessing. IEEE. 2012, pp. 665–668.

[LZ01] Rainer Lienhart and Andre Zaccarin. “A system for reliable dissolve detection in videos”. In: ImageProcessing, 2001. Proceedings. 2001 International Conference on. Vol. 3. IEEE. 2001, pp. 406–409.

[Nap+98] Milind R Naphade et al. “A high-performance shot boundary detection algorithm using multiplecues”. In: Image Processing, 1998. ICIP 98. Proceedings. 1998 International Conference on. Vol. 1. IEEE.1998, pp. 884–887.


http://dx.doi.org/10.1109/ICPR.2004.1333883

http://dx.doi.org/10.1109/ICIP.2004.1421759


[Ohm+03] Jens-Rainer Ohm et al. “The MPEG-7 Color Descriptors”. In: (2003).

[Par+11] Omkar M Parkhi et al. “The Truth About Cats and Dogs”. In: International Conference on ComputerVision (ICCV). Barcelona, Spain, 2011.

[PW05] F. Porikli and C.Wren. “Change detection by frequency decomposition:Wave-back”. In:Workshopon Image Analysis for Multimedia Interactive Services (WIAMIS). 2005.

[RBK98] H.A. Rowley, S. Baluja, and T. Kanade. “Neural network- based face detection”. In: IEEE Transac-tions on Pattern Analysis and Machine Intelligence 20.1 (1998), pp. 23–38.

[Red+15] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. In: CoRRabs/1506.02640 (2015). URL: http://arxiv.org/abs/1506.02640.

[Ren+15] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region ProposalNetworks”. In: Neural Information Processing Systems (NIPS). 2015.

[RF03] Deva Ramanan and D.A. Forsyth. “Using Temporal Coherence to Build Models of Animals”. In:International Conference on Computer Vision (ICCV). Nice, France, 2003, pp. 1–8. URL: http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=1238364.

[RFB05] D. Ramanan, D. A. Forsyth, and K. Barnard. “Detecting, Localizing and Recovering Kinematicsof Textured Animals”. In: Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 2.San Diego, California, USA: Ieee, 2005, pp. 635–642. ISBN: 0-7695-2372-2. DOI: 10 . 1109 /CVPR.2005.126. URL: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1467501.

[RFB06] Deva Ramanan, David A. Forsyth, and Kobus Barnard. “Building Models of Animals from Video.”In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28.8 (Aug. 2006), pp. 1319–34. ISSN: 0162-8828. DOI: 10.1109/TPAMI.2006.155.

[RG01] Lukas Rosenthaler and Rudolf Gschwind. “Restoration of movie films by digital image processing”.In: (2001).

[RRB12] S. Roy, S. Roy, and S.K. Bandyopadhyay. “A Tutorial Review on Face Detection”. In: InternationalJournal of Engineering Research & Technology (IJERT) 1.8 (2012).

[RS03] Zeeshan Rasheed and Mubarak Shah. “Scene detection in Hollywood movies and TV shows”. In:Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conferenceon. Vol. 2. IEEE. 2003, pp. II–343.

[SA04] Koichi Sato and J. K. Aggarwal. “Temporal spatio-velocity transform and its application to trackingand interaction”. In: Computer Vision and Image Understanding 96.2 (2004), pp. 100–128. DOI:10.1016/j.cviu.2004.02.003.

[SA07] Evaggelos Spyrou and Yannis Avrithis. “Keyframe extraction using local visual semantics in theform of a region thesaurus”. In: Semantic Media Adaptation and Personalization, Second InternationalWorkshop on. IEEE. 2007, pp. 98–103.

[SAM13] Filippo Stanco, Dario Allegra, and Filippo Luigi Maria Milotta. “Detection and correction of Mis-tracking in digitalized analog video”. In: International Conference on Image Analysis and Processing.Springer. 2013, pp. 218–227.


http://arxiv.org/abs/1506.02640

http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=1238364

http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=1238364

http://dx.doi.org/10.1109/CVPR.2005.126


http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1467501

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1467501


http://dx.doi.org/10.1016/j.cviu.2004.02.003


[SC05] Bogdan Smolka and Andrzej Chydzinski. “Fast detection and impulsive noise removal in colorimages”. In: Real-Time Imaging 11.5 (2005), pp. 389–402.

[SC97] Richard Szeliski and James Coughlan. “Spline-Based Image Registration”. In: International Journal ofComputer Vision 22.3 (1997), pp. 199–218. DOI: 10.1023/A:1007996332012.

[SG10] Fangxia Shi and Xiaojun Guo. “Keyframe extraction based on kmeas results to adjacent DC imagessimilarity”. In: Signal Processing Systems (ICSPS), 2010 2nd International Conference on. Vol. 1. IEEE.2010, pp. V1–611.

[SG99] C. Stauffer and W. Grimson. “Adaptive Background Mixture Models for Real-Time Tracking”. In:Computer Vision and Pattern Recognition (CVPR). Fort Collins, CO, USA, 1999.

[Sha+14] Muhammad Shahid et al. “No-reference image and video quality assessment: a classification andreview of recent approaches”. In: EURASIP Journal on Image and Video Processing 2014.1 (2014),p. 1.

[SL96] G. Sudhir and John C.M. Lee. “Video Annotation by Motion Interpretation Using Optical FlowStreams”. In: Journal of Visual Communication and Image Representation 7.4 (1996), pp. 354–368.ISSN: 10473203. DOI: 10.1006/jvci.1996.0031.

[SOD10] Alan F Smeaton, Paul Over, and Aiden R Doherty. “Video shot boundary detection: Seven yearsof TRECVid activity”. In: Computer Vision and Image Understanding 114.4 (2010), pp. 411–418.

[SOK06] Alan F. Smeaton, Paul Over, and Wessel Kraaij. “Evaluation campaigns and TRECVid”. In: MIR’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval. SantaBarbara, California, USA: ACM Press, 2006, pp. 321–330. ISBN: 1-59593-495-2. DOI: http ://doi.acm.org/10.1145/1178677.1178722.

[SS99] R.E. Schapire and Y. Singer. “Improved Boosting Algorithms Using Confidence-Rated Predictions”.In: Machine Learning 37.3 (1999), pp. 297–336.

[ST94] Jianbo Shi and C. Tomasi. “Good features to track”. In: IEEE Conference on Computer Vision and Pat-tern Recognition. 1994, pp. 593–600. ISBN: 0-8186-5825-8. DOI: 10.1109/CVPR.1994.323794.

[STL04] Gary J Sullivan, Pankaj N Topiwala, and Ajay Luthra. “The H. 264/AVC advanced video codingstandard: Overview and introduction to the fidelity range extensions”. In: Optical Science andTechnology, the SPIE 49th Annual Meeting. International Society for Optics and Photonics. 2004,pp. 454–474.

[SVH97] M. V. Srinivasan, S. Venkatesh, and R. Hosie. “Qualitative estimation of camera motion parametersfrom video sequences”. In: Pattern Recognition 30.4 (1997), pp. 593–606. ISSN: 00313203. DOI:10.1016/S0031-3203(96)00106-9.

[Tak+03] Shinichi Takagi et al. “Sports video categorizing method using camera motion parameters”. In:Visual Communications and Image Processing 2003. International Society for Optics and Photonics.2003, pp. 2082–2088.

[Tes+10] Nikola Teslic et al. “Packet-loss error detection system for DTV and set-top box functional test-ing”. In: IEEE Transactions on Consumer Electronics 56.3 (2010), pp. 1311–1319.

[TG00] KT Tan and Mohammed Ghanbari. “Blockiness detection for MPEG2-coded video”. In: IEEE SignalProcessing Letters 7.8 (2000), pp. 213–215.


http://dx.doi.org/10.1023/A:1007996332012

http://dx.doi.org/10.1006/jvci.1996.0031

http://dx.doi.org/http://doi.acm.org/10.1145/1178677.1178722

http://dx.doi.org/http://doi.acm.org/10.1145/1178677.1178722


http://dx.doi.org/10.1016/S0031-3203(96)00106-9


[Tri10] George A Triantafyllidis. “Image quality measurement in the frequency domain”. In: Communica-tions, Control and Signal Processing (ISCCSP), 2010 4th International Symposium on. IEEE. 2010, pp. 1–4.

[TV07] Ba Tu Truong and Svetha Venkatesh. “Video abstraction: A systematic review and classification”.In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 3.1(2007), p. 3.

[TVD03] Ba Tu Truong, Svetha Venkatesh, and Chitra Dorai. “Scene extraction in motion pictures”. In:Circuits and Systems for Video Technology, IEEE Transactions on 13.1 (2003), pp. 5–15.

[TZ04] Wallapak Tavanapong and Junyu Zhou. “Shot clustering techniques for story browsing”. In: Multi-media, IEEE Transactions on 6.4 (2004), pp. 517–527.

[VJ01] P. Viola and M. Jones. “Rapid object detection using a boosted cascade of simple features”. In:Conference on Computer Vision and Pattern Recognition (CVPR). Kauai, Hawai, USA, 2001, pp. 511–518.

[Vla00] T Vlachos. “Detection of blocking artifacts in compressed video”. In: Electronics Letters 36.13(2000), pp. 1106–1108.

[VRB01] C. J. Veenman, M.J.T. Reinders, and E. Backer. “Resolving motion correspondence for denselymoving points”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 23.1 (2001), pp. 54–72. ISSN: 01628828. DOI: 10.1109/34.899946.

[WBE00] Zhou Wang, Alan C Bovik, and BL Evan. “Blind measurement of blocking artifacts in images”. In:Image Processing, 2000. Proceedings. 2000 International Conference on. Vol. 3. Ieee. 2000, pp. 981–984.

[WLY13] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. “Online Object Tracking: A Benchmark”. In: 2013IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013, pp. 2411–2418. ISBN:1063-6919. DOI: 10.1109/CVPR.2013.312.

[XL98] Wei Xiong and John Chung-Mong Lee. “Efficient Scene Change Detection and Camera MotionAnnotation for Video Classification”. In: Computer Vision and Image Understanding 71.2 (1998),pp. 166–181. DOI: 10.1006/cviu.1998.0711.

[YJS06] Alper Yilmaz, Omar Javed, and Mubarak Shah. “Object tracking”. In: ACM Computing Surveys 38.4(2006), 13–es. ISSN: 0360-0300. DOI: 10.1145/1177352.1177355.

[YK02] Wei-Qi Yan and Mohan S Kankanhalli. “Detection and Removal of Lighting & Shaking Artifacts inHome Videos”. In: Proceedings of the Tenth ACM International Conference on Multimedia. MULTIME-DIA ’02. Juan-les-Pins, France: ACM, 2002, pp. 107–116. ISBN: 1-58113-620-X. DOI: 10.1145/641007.641031. URL: http://doi.acm.org/10.1145/641007.641031.

[YKA02] M.H. Yang, D.J. Kriegman, andN. Ahuja. “Detecting Faces in Images: A Survey”. In: IEEE Transactionson Pattern Analysis and Machine Intelligence 24.1 (2002), pp. 34–58.

[Zha+09] Ranran Zhang et al. “The Correction of Intensity Flicker in Archived Film”. In: Information Technol-ogy and Computer Science, 2009. ITCS 2009. International Conference on. Vol. 2. IEEE. 2009, pp. 402–405.

[ZMM99] Ramin Zabih, Justin Miller, and Kevin Mai. “A feature-based algorithm for detecting and classifyingproduction effects”. In: Multimedia systems 7.2 (1999), pp. 119–128.


http://dx.doi.org/10.1109/34.899946


http://dx.doi.org/10.1006/cviu.1998.0711

http://dx.doi.org/10.1145/1177352.1177355

http://dx.doi.org/10.1145/641007.641031

http://dx.doi.org/10.1145/641007.641031

http://doi.acm.org/10.1145/641007.641031


[ZW94] R. Zabih and J. Woodfill. “Non-Parametric Local Transforms for Computing Visual Correspon-dence”. In: European Conference on Computer Vision (ECCV). 1994, pp. 151–158.


deliverable - · pdf fileplace during the preparation of the project as well as the very ......

Documents