event driven architecture for distributed surveillance...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #**** CVPR #**** CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Event Driven Architecture for Distributed Surveillance Research Systems Anonymous CVPR submission Paper ID **** Abstract 1. Introduction Automated visual surveillance is drawing more and more attention both in industry and academia. Over the past years, computer vision research has considerably maturated and the number of real implementations by means of in- dustrial partners is growing day by day. Accordingly to market analysis, the annual growth in security industry is close to 37% [1] and the evolution of video surveillance is overwhelming[2]. A flurry of different algorithms and tech- niques have been proposed, facing from the lower to the higher level aspects of the system, starting from the object detection and segmentation to the behavior classification. In addition to isolated solutions, several proposals have been also presented describing how to combine elementary algo- rithms in a complete framework. A real world surveillance application, in fact, usually requires a good integration and replication of a lot of modules: integration in order to face compound problems and replication to manage more than one video source at the same time. Architectural and devel- opment issues are unavoidable in distributed camera sys- tems. Leading companies in surveillance have proposed their own integrated frameworks, more focusing on reliability, extendibility, and scalability aspects besides efficacy ones. For example, IBM S3 [3] is an open and extensible frame- work for performing real-time event analysis. S3 has a two-layers architecture, with a plug-ins based analytics en- gine on the bottom and a database engine for event man- agement on the top. New detection capabilities can be in- cluded by adding plug-ins to the low level system (SSE - Smart Surveillance Engine), while all the fusion tasks are provided at the high level (MILS - Middleware for Large Scale Surveillance). Object detection, tracking, and clas- sification are basic SSE technologies. Alert and abnor- mal event detection activities are instead performed at the MILS level. Similarly, ObjectVideo proposed the VEW system [2], developed starting from the prototype made un- der the VSAM (Video Surveillance And Monitoring) part of DARPA’s IUBA program [4]. Sarnoff Corporation pro- posed a fully integrated system called Sentient Environment, that can detect and track multiple humans over a wide area using a network of stereo cameras. Each stereo camera works independently proving detection and tracking capa- bilities, while data fusion is provided by means of a Multi- camera tracker (McTracker) which stores the extracted in- formation on a databased for further query and analysis tasks. Differently from these commercial frameworks, research laboratories need open platforms for testing, comparing and composing different algorithms and techniques. Maybe, more than one algorithm should be active at the same time to solve the same problem. Reconfigurability and flexibility are more important than reliability and scalability aspects. An important tool for research laboratories is the OpenCV (Open Source Computer Vision) library, a set of program- ming functions mainly aimed at real time computer vision. It aims at (i) advancing vision research by providing not only open but also optimized code for basic vision infras- tructure (no more reinventing the wheel), and at dissemi- nating vision knowledge by providing a common infrastruc- ture that developers could build on, so that code would be more readily readable and transferable. However, OpenCV does not contain any facility for module composition and for multiple camera integration. 2. Guidelines for distributed surveillance sys- tems New computational resources and technological pro- gresses allow the development of integrated surveillance system, in which more than one video source are simulta- neously processed at the same time and the extracted infor- mation are merged and managed together. 2.1. A three layer tracking system Let us consider a system with a lot of cameras, covering a wide area. On each camera (video source) a set of standard tasks should be performed, such as object detection, track- 1

Upload: vuminh

Post on 20-Apr-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Event Driven Architecture for Distributed Surveillance Research Systems

Anonymous CVPR submission

Paper ID ****

Abstract

1. IntroductionAutomated visual surveillance is drawing more and more

attention both in industry and academia. Over the pastyears, computer vision research has considerably maturatedand the number of real implementations by means of in-dustrial partners is growing day by day. Accordingly tomarket analysis, the annual growth in security industry isclose to 37% [1] and the evolution of video surveillance isoverwhelming[2]. A flurry of different algorithms and tech-niques have been proposed, facing from the lower to thehigher level aspects of the system, starting from the objectdetection and segmentation to the behavior classification. Inaddition to isolated solutions, several proposals have beenalso presented describing how to combine elementary algo-rithms in a complete framework. A real world surveillanceapplication, in fact, usually requires a good integration andreplication of a lot of modules: integration in order to facecompound problems and replication to manage more thanone video source at the same time. Architectural and devel-opment issues are unavoidable in distributed camera sys-tems.

Leading companies in surveillance have proposed theirown integrated frameworks, more focusing on reliability,extendibility, and scalability aspects besides efficacy ones.For example, IBM S3 [3] is an open and extensible frame-work for performing real-time event analysis. S3 has atwo-layers architecture, with a plug-ins based analytics en-gine on the bottom and a database engine for event man-agement on the top. New detection capabilities can be in-cluded by adding plug-ins to the low level system (SSE -Smart Surveillance Engine), while all the fusion tasks areprovided at the high level (MILS - Middleware for LargeScale Surveillance). Object detection, tracking, and clas-sification are basic SSE technologies. Alert and abnor-mal event detection activities are instead performed at theMILS level. Similarly, ObjectVideo proposed the VEWsystem [2], developed starting from the prototype made un-

der the VSAM (Video Surveillance And Monitoring) partof DARPA’s IUBA program [4]. Sarnoff Corporation pro-posed a fully integrated system called Sentient Environment,that can detect and track multiple humans over a wide areausing a network of stereo cameras. Each stereo cameraworks independently proving detection and tracking capa-bilities, while data fusion is provided by means of a Multi-camera tracker (McTracker) which stores the extracted in-formation on a databased for further query and analysistasks.

Differently from these commercial frameworks, researchlaboratories need open platforms for testing, comparing andcomposing different algorithms and techniques. Maybe,more than one algorithm should be active at the same timeto solve the same problem. Reconfigurability and flexibilityare more important than reliability and scalability aspects.An important tool for research laboratories is the OpenCV(Open Source Computer Vision) library, a set of program-ming functions mainly aimed at real time computer vision.It aims at (i) advancing vision research by providing notonly open but also optimized code for basic vision infras-tructure (no more reinventing the wheel), and at dissemi-nating vision knowledge by providing a common infrastruc-ture that developers could build on, so that code would bemore readily readable and transferable. However, OpenCVdoes not contain any facility for module composition andfor multiple camera integration.

2. Guidelines for distributed surveillance sys-tems

New computational resources and technological pro-gresses allow the development of integrated surveillancesystem, in which more than one video source are simulta-neously processed at the same time and the extracted infor-mation are merged and managed together.

2.1. A three layer tracking system

Let us consider a system with a lot of cameras, covering awide area. On each camera (video source) a set of standardtasks should be performed, such as object detection, track-

1

Page 2: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

ing, classification. Usually these tasks can be performedindependently, without strong interactions among cameras,and the final aim of each one is the generation and man-agement of a set of tracks. We call single camera track (orjust track) [5] the logical entity which represents a movingobject inside the field of view of a camera. Since the surveil-lance system detects and can be interested on vehicles, an-imals in addition to people, hereinafter we more generallyrefer to them with the term moving object. The track shouldbe consistently maintained and updated during all the timeis visible. The object’s current appearance and position, aswell as its history, are stored inside the track state. Thus,each single camera processing (SCP) system aims at gener-ating and handling a list of tracks.

When more than one single camera processing is activeat the same time, more knowledge about the scene can beobtained from the combination of the single outputs. Forexample, instead of simply merging the outputs, obtaininga collection of tracks coming from different points of view,we can take care of identifying when multiple tracks referto the same object. This task is called Consistent Labeling[6] and can be performed by means of geometrical and/orvisual information (see [7] for a more exhaustive review).The choice of the consistent labeling technique is stiffly re-lated to the camera set up. In particular, if the fields of viewof two or more cameras are overlapped, geometrical con-straints can be used and a strong interaction among the in-volved single camera environments is required to fully ex-ploit the time interval during which the same object goesthrough the overlapping zone and it is contemporaneouslyvisible by all the cameras. We call multi camera processing(MCP) the processing system devoted to control an over-lapping set of cameras and which produces a set of multi-camera tracks (mtracks). Mtracks shall have a one-to-onecorrespondence with real observed objects and embed in-formation coming from all their different views (cameras).

A weaker synchronization is required, instead, when theconsistent labeling is performed among non overlappingcameras or when we are trying to re-identify the same objectonce it comes back to the scene after a while. In such a sit-uation, the consistent labeling (or people re-identification)task is operated by a distributed camera processing (DCP)system.

Summarizing, the overall surveillance tracking systemcan be seen as a set of clusters of overlapping cameras.Each node consists of a single camera processing, embed-ding the traditional single view stack of tasks. Inside eachcluster, a strong interaction among nodes guarantee the con-sistent labeling by means of geometrical and appearanceconstraints. Information coming from each cluster are thenmerged and managed by an higher level processing. Theresulting framework is a three-layer tracking system as de-picted in Fig.1.

Figure 1. The proposed Three-layer tracking system for surveil-lance of wide areas

2.2. Multicamera vs Distributed: the parallel com-puting paradigm

We have a sort of “parallel processing” of videos, com-pletely in analogy with a parallel computing model. Paral-lel computing is a form of computation in which many cal-culations are carried out simultaneously [8]. The memorymodel and the communication/synchronization schema arethe main aspects considered for their classification. In fact,main memory in a parallel computer is either shared mem-ory (shared between all processing elements in a single ad-dress space), or distributed memory (in which each process-ing element has its own local address space). Distributedmemory refers to the fact that the memory is logically dis-tributed, but often implies that it is physically distributed aswell. The communication/synchronization schema, instead,is related to hardware issues such as the memory modelas well as algorithmic problems, like the size and the fre-quency of the process interactions.

Coming back to the surveillance framework, we can ap-ply similar considerations. Both multi and distributed cam-era systems call for for a parallel processing architecture.When cameras are overlapping, a strong interaction is re-quired in order to exploits all the information available andto guarantee time synchronization during the handoff pro-cess. Thus, multicamera systems can be assimilated toa multicore computer architecture with a shared memorymodel.

...

2.3. Static Service Oriented Architecture

In addition to the tracking, a complete surveillance sys-tem should include several high-level modules. Face detec-tion, face recognition, action analysis, event detection, andso on are some examples of functionality which an auto-matic people surveillance system needs to integrate. Othermodules can be added depending on the application and themonitored environment. Some of these tasks should or canbe performed on a subset of the installed cameras only, orwith a particular time schedule. Moreover, the same task(e.g., face detection) can be carried on using different tech-niques. A wide surveillance system should be able to takeinto account all these requirements, supporting this plethoraof services. As proposed by Detmold et al[9], a Service Ori-

2

Page 3: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

ented Architecture (SOA) approach [10] can be efficientlyused for distributed video surveillance systems. Similarlyto [9], we have implemented each high-level functionalityas an individual and isolated service, even if no standard de-scription languages have been adopted to define the servicesand the orchestration, using a static SOA approach [11, 12].Static SOA environments are similar to traditional systems,in which “services” acts similarly to “components”. In addi-tion, dynamic composition is possible (at run-time it is pos-sible to configure the set of working services), but dynamicdiscovery is limited to a predefined service pool. Surveil-lance services comprise unassociated, loosely coupled unitsof functionality that have no calls to each other embedded inthem. The main goal is the reconfigurability of the system,leading to low marginal costs of creating the n-th applica-tion.

2.4. Event Driven communication Architecture -EDA

The modular architecture described in section 2.3 hasbeen developed to keep each module isolated from the oth-ers. Anyway, a communication architecture is required tomanage the overall system. To this aim we propose anevent driven communication architecture (EDA) [13]. EDAis a well known software pattern; events and their produc-tion, detection, consumption of, and reaction are the coreof such applications. It is usually applied as a synchroniza-tion method among loosely coupled software componentsand services. An event-driven system typically consistsof agents (event generators) and sinks (event consumers).Sinks have the responsibility of applying a reaction as soonas an event is raised by an agent.

EDA is recommended specially when:

• application stages should run even when another adja-cent stage is not running;

• there will be a need to frequently add, drop or mod-ify processing stages without affecting any previous orsubsequent stage;

• multiple stages may execute simultaneously;

• two or more sinks (stages) need the same input eventdata.

The event based communication paradigm is used in awide range of applications and it regulates the behavior andthe interactions among items in nature as well.

Chandy et. al. in [14] report definitions and expoundsome ideas which should be considered during the develop-ment of event-driven systems. Using the proposed termi-nology, we describe the developed surveillance framework.

Events. Our architecture follows the “sense-and-respond” schema. The set of input cameras, together with

low level processing units, are the sensing parts inside our“universe”. Then, the other processing modules usually op-erate as processing agents and responders at the same time,depending on their function and their role. Each time a pro-cessing module detects a significant state change in the ana-lyzed data it will generate an event. In particular, the track-ing modules will generate events such as NewTrackEnter-sTheScene, TrackExitsTheScene, TrackStateUpdate, and soon. Higher level modules can operate as a responders sincethey will capture events from lower level modules, and canperform processing tasks and to generate their own events.For example, a face detection module is a responser of thetracking system; each time it receives a NewTrackEnters-TheScene event performs face detection on the track ap-pearance, and generates events such as NewFaceDetected.Some modules, such as the annotation modules, operatesas responders only. A lot of events are obtained combin-ing other base events or my means of temporal processing,and thus we can consider them as Complex Events and ourarchitecture as a Complex Event Processing (CEP) [13].

Information flow. The connections between modulesare fully configurable, allowing each module to subscribeto events of any other module. In particular, two differentconnection types are created. Directly from the work flowof traditional video surveillance applications a tree structurecan be created. Starting from the video sources as leaves upto the global application as root, each module has a specificparent, which is automatically subscribed to its children’sevents. This structure follows the “has-a” component rela-tions. In addition to the default parent, other modules cansubscribe to a particular agent’s events. For example, theannotation module collects events coming from differentsources, also from components belonging to other branchesof the tree. In Fig. 2 the network of event subscriptionsis depicted. Continuous arrows represent the parental tree,while dashed ones indicate additional subscriptions.

Modes of Interaction between Components. ThreeModes of Interaction between Components have been de-fined: Schedule, Pull, Push [14]. Our framework exploitsall of them:

• Schedule: the normal activity of the framework is reg-ulated by a heart-beat clock which trigger the normalprocessing activity of each module. For example, eachSCP modules performs the object detection and track-ing steps, other modules may update internal countersonly, and so on; during these “normal” activities eachmodule can detect state changes and accordingly gen-erates events.

• Push: the flow of events, instead, follows the Push in-teraction, since they arise when the source module de-tects a state change and want to propagate it to respon-sible listeners.

3

Page 4: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2. Event Flow: Continuous arrows represent the parental tree, while dashed ones indicate additional subscriptions.

• Pull: specially for optimization reasons, not all thestate changes are propagated by means of events; thus,sometimes some modules need to sample the state ofother modules, applying a push interaction.

3. Surveillance services

The proposed paradigms and techniques have been inte-grated in a C++ video surveillance Library. In addition tobase classes for image processing, motion detection, objecttracking, as well as video source management, video output,and graphical interface, the library supports the definitionof isolated services as well as an event driven orchestrationamong services. In this section a brief description of eachimplemented service is given, with particular emphasis onthe three layer tracking system, which is the core service ofeach surveillance application.

3.1. The three tracking layers

In this section a summary description of the imple-mented three tracking layers (i.e., the SCP, the MCP, andthe DCP layer) is given.

3.1.1 SCP: the Sakbot system

In this layer, we assume that the models of the target objectsand their motion are unknown, so as to achieve maximumapplication independence. In the absence of any a prioriknowledge about target and environment, the most widelyadopted approach for moving object detection with fixedcamera is based on background subtraction [15, 16].

It is well known that background subtraction carries twoproblems: the first is that the model should reflect the realbackground as accurately as possible, to allow the systemaccurate shape detection of moving objects. The secondproblem is that the background model should immediatelyreflect sudden scene changes such as the start or stop of

objects, so as to allow detection of only the actual mov-ing objects with high reactivity (the “transient background”case). If the background model is neither accurate nor re-active, background subtraction causes the detection of falseobjects, often referred to as “ghosts” [15, 16]. In addition,moving object segmentation with background suppressionis affected by the problem of shadows [17, 18]. Indeed,we would like the moving object detection to not classifyshadows as belonging to foreground objects, since the ap-pearance and geometrical properties of the object can bedistorted and delocalized, which in turn affects many sub-sequent tasks. Moreover, the probability of object under-segmentation (where more than one object is detected as asingle object) increases due to connectivity via shadows be-tween different objects.

The approach adopted in our framework has been de-fined in [19, 20] and is called Sakbot (Statistical AndKnowledge-Based ObjecT detection) since it exploits statis-tics and knowledge of the segmented objects to improveboth background modeling and moving object detection.Sakbot integrates the shadow removal algorithm describedin [21].

After a foreground image F has been extracted the realmoving objects generating F must be identified, separated,and followed over time. To this aim, we use an appearance-based tracking algorithm, which is characterized by somepeculiarities that makes it particularly suitable for videosurveillance applications. Appearance-based tracking isa well established paradigm to predict, match and ensuretemporal coherence of detected deformable objects in videostreams. These techniques are very often adopted as a validalternative to approaches based on 3D reconstruction andmodel matching, by computing the visual appearance of theobjects in the image plane only, without the need of definingcamera, world and object models. Especially in human mo-tion analysis applications, the exploitation of appearancemodels or templates is straightforward. Templates enable

4

Page 5: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

the knowledge not only of the location and speed of visi-ble people but also their visual aspect, their silhouette or thebody shape at each frame.

Our tracking system focuses on the problem of dealingwith different classes of occlusions in the appearance-baseddeterministic tracking framework to follow deformableshapes and in particular human shapes. The primary goal isto have, at each frame, an appearance-model at pixel level asmore accurate as possible, with a very reactive update pro-cess copying with frequent shape variations. At the sametime, we want to deal with the problem of both dynamicand scene occlusions during their occurrence and not after,updating the appearance model selectively. A formal defi-nition of our approach, called Appearance Driven trackingwith Occlusion Classification (Ad Hoc) is reported in [22].

In Fig. 3 we have collected some snapshots of the systemoutput, proving that AD-HOC allows to correctly managemutual occlusions and foremost object recognition. Evenif AD-HOC is not able to recognize groups of people en-tering at the same time and visually connected, the systemembeds a split algorithm able to manage group divisions.Finally, AD-HOC is a model free techniques; thus, it can bealso used to track vehicles or objects, such as the van in thedistance in Figs. 3(f) and 3(g).

(a) V5-fr.703 (b) V6-fr.278 (c) V7-fr.200 (d) V7-fr.279

(e) V8-fr.32 (f) V8-fr.154 (g) V8-fr.164 (h) V8-fr.257

Figure 3. Qualitative results on PETS2009 dataset, views V5-V8.The snapshots have been selected to show AD-HOC robustness tomutual occlusions, even when the object colors are very similareach other. AD-HOC is model free, thus can be also used to trackvehicles or objects, such as the van in the distance in (f) and (g).

3.1.2 MCP: Multicamera tracking

As previously defined in section ??, the multicamera sys-tem is devoted to control a cluster of overlapping cameras,performing a consistent labeling task over the sets of singlecamera tracks. We use an approach belonging to the class ofgeometry-based techniques. Let us suppose that the systemis composed of a set C = {C1, C2, ..., Cn} of n cameras,with each camera Ci overlapped with at least another cam-era Cj .

Each time a new track is detected in the camera Ci inthe overlapping area, its support point is projected in Cj

by means of the homographic transformation. The coor-dinates of the projected point could not correspond to thesupport point of an actual object. For the match we se-lect the object in Cj whose support point is at the mini-mum Euclidean distance in the 2D plane from these coor-dinates. The match is then reinforced by means of epipolarconstraints in a Bayesian framework, as fully described in[7].

Some snapshots of the output of the system (in non-trivial conditions) after the consistent labeling assignmentare reported in Fig. 4.

(a) C1 at frame#783

(b) C2 at frame#783

(c) C1 at frame#1080

(d) C2 at frame#1080

Figure 4. Some snapshots of the output of the system after consis-tent labeling.

3.1.3 DSP: Distributed camera tracking

Let us suppose to have a wide space monitored by a set ofnon overlapped fixed cameras (or non overlapping multiplecamera systems). If the fields of view of the cameras areclose enough, it is plausible that a person exiting a cam-era field of view will soon appear on the field of view of aneighbor camera. Moreover, if the illumination conditionsand the camera type are not so different, a person will gen-erate similar color histograms on different views. To obtaina distributed consistent labeling, we are interested on com-paring two different MTracks in order to detect if they be-long to the same moving object. Let us introduce the globalCompatibility Score between two tracks, i.e., a value whichindicates if the two tracks can be effectively generated bythe same real object. Compatibility Score is defined in anegative way: a good Compatibility Score does not meanthat the two objects are the same, but that nothing has beendetected to say they are not. Temporal, geometrical, and ap-pearance constraints are both considered. For example, wecan state that two tracks are not related to the same personif they appear at the same time on different non overlappedcameras, or if they appear in two different cameras but thetime delay is not enough to cover the distance; similarly,differences detected on clothes colors, or the real height,decreases the Compatibility Score.

As described in [?], we modeled the Compatibility Scoreby means of two terms. One is based on color histogramand the other one on the mutual camera positioning. To thisaim we propose to exploit computer graphics techniques:

5

Page 6: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

interactions between different scenes and the movements ofavatars are usually managed with graph models by the com-puter graphic and virtual reality communities. All the pos-sible avatar positions (or rooms) are represented by nodesand the connecting arcs refers to allowed paths. The se-quence of visited nodes is called pathnodes. A weight canbe associated to each arc in order to give some measures onit, such as the duration, the likelihood to be chosen with re-spect to other paths, and so on. We empirically set the arcweights, but a learning phase can be established to automat-ically compute them.

In Fig. ?? a sample test bed is depicted; three camerasare installed and the correspondent fields of view are high-lighted. A person walking in the represented scene can stayinside the field of view of the same camera, can move to-ward another one trough a transition zone or can definitelyexits the area. According with the allowed paths, the over-all area can be divided into three region types: A. Visibleregions, corresponding to camera’s fields of view; B. Nonvisible regions, i.e., transition areas between two or morevisible regions; C. Forbidden or exit regions.

In particular we are interested to identify the boundariesof these regions. People inside visible regions can be de-tected and tracked by the corresponding single camera sys-tem; once they exit that region to enter in a non visibleregion, they will likely appear again in another visible re-gion. The weight associated to each arc is considered forthe Compatibility Score computation. People exiting theoverall scene are no more considered by the system, andtheir Compatibility Score with subsequent tracks are null.(Fig ??).

3.2. High level surveillance services

The tracking module is only a part of a complete surveil-lance system. Other items should be added in order to au-tomatically extract useful information and events. For ex-ample, we have focused our activity to people surveillance,which calls for tasks such as face detection and recognition,posture and behavior classification, event detection, and soon. These high level modules are not directly included inthe different layers of the tracking system, but are kept in-dependent. In Fig. 5 a schema of the framework is reportedand in Table 1 the main services implemented in the proto-type are described.

4. Application Prototype

5. Conclusions and future work

6. Acknowledgments

Figure 5.

References[1] Research and Markets, “Report on closed circuit tv industry

-a market update (2005-2008),” 2005. 1

[2] Niels Haering, Péter L. Venetianer, and Alan Lip-ton, “The evolution of video surveillance: an overview,”Mach. Vision Appl., vol. 19, no. 5-6, pp. 279–290, 2008. 1

[3] Y.L. Tian, L.M. Brown, A. Hampapur, M. Lu, A. Senior,and C.F. Shu, “Ibm smart surveillance system (s3): eventbased video surveillance system with an open and extensibleframework,” Machine Vision and Applications, vol. 19, no.5-6, pp. xx–yy, October 2008. 1

[4] Robert Collins, Alan Lipton, Takeo Kanade, HironobuFujiyoshi, David Duggins, Yanghai Tsin, David Tolliver,Nobuyoshi Enomoto, and Osamu Hasegawa, “A system forvideo surveillance and monitoring,” Tech. Rep. CMU-RI-TR-00-12, Robotics Institute, Pittsburgh, PA, May 2000. 1

[5] Rita Cucchiara, Costantino Grana, Massimo Piccardi, andAndrea Prati, “Detecting moving objects, ghosts and shad-ows in video streams,” IEEE Transactions on Pattern Analy-sis and Machine Intelligence, vol. 25, no. 10, pp. 1337–1342,Oct. 2003. 2

[6] S. Khan and M. Shah, “Consistent labeling of tracked objectsin multiple cameras with overlapping fields of view,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 25, no. 10, pp. 1355–1360, Oct. 2003. 2

[7] Simone Calderara, Rita Cucchiara, and Andrea Prati,“Bayesian-competitive consistent labeling for peoplesurveillance,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 30, no. 2, pp. 354–360, Feb.2008. 2, 5

6

Page 7: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Service DescriptionFace detection We integrated in the framework two different face detectors: the well known Vi-

ola&Jones [23] and the face detection library by Kienzle et al [24]Posture classification The frame-by-frame posture of each person can be classified by means of the visual

appearance. The implemented posture classification is based on projection histogramsand select the most likely posture among Standing, Sitting, Crouching and Laying Down[25].

Appearance based actionrecognition

The action in progress is detected using features extracted from the appearance data.Walking, waving, pointing are some examples of the considered actions. Two differentmodules have been developed: the first is based on Hidden Markov Models [26] and thesecond on action signature [27].

Trajectory-based actionrecognition

People trajectories (i.e., frame by frame positions of the monitored tracks) embed infor-mation about the people behavior; in particular they can be used to detect abnormal pathswhich can be related to suspicious events. A trajectory classifier has been implementedin our system following the algorithm described in [28].

Smoke detector Smoke detection in video-surveillance systems is still an open challenge for computervision and pattern recognition communities. It concerns the definition of robust ap-proaches to detect, as soon as possible, spring and fast propagation of smoke possiblydue to explosions, fires or special environmental conditions. The color properties ofthe object are analyzed accordingly to a smoke reference color model to detect if colorchanges in the scene are due to a natural variation or not. The input image is then di-vided in blocks of fixed sized and each block is evaluated separately. Finally a Bayesianapproach detect whether a foreground object is smoke. More details can be found in[29].

Table 1. Implemented services

[8] “Review of ”highly parallel computing” by g. s. almasi anda. gottlieb, benjamin-cummings publishers, redwood city, ca,1989,” IBM Syst. J., vol. 29, no. 1, pp. 165–166, 1990,Reviewer-Lorin, Harold R. 2

[9] Henry Detmold, Anton van den Hengel, Anthony Dick, Ka-trina Falkner, David S. Munro, and Ron Morrison, “Middle-ware for distributed video surveillance,” IEEE DistributedSystems Online, vol. 9, 2008. 2, 3

[10] Thomas Erl, Service-Oriented Architecture: Concepts, Tech-nology, and Design, Prentice Hall PTR, Upper Saddle River,NJ, USA, 2005. 3

[11] W.-T. Tsai, Xiao Wei, Zhibin Cao, Raymond Paul, YinongChen, and Jingjing Xu, “Process specification and modelinglanguage for service-oriented software development,” march2007, pp. 181 –188. 3

[12] M.B. Blake, “Decomposing composition: Service-orientedsoftware engineers,” Software, IEEE, vol. 24, no. 6, pp. 68–77, nov.-dec. 2007. 3

[13] David C. Luckham, The Power of Events: An Introduc-tion to Complex Event Processing in Distributed EnterpriseSystems, Addison-Wesley Longman Publishing Co., Inc.,Boston, MA, USA, 2001. 3

[14] K. Mani Chandy, Michel. Charpentier, and Agostino Cap-poni, “Towards a theory of events,” in DEBS ’07: Pro-ceedings of the 2007 inaugural international conference onDistributed event-based systems, New York, NY, USA, 2007,pp. 180–187, ACM. 3

[15] I. Haritaoglu, D. Harwood, and L.S. Davis, “W4: real-timesurveillance of people and their activities,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence, vol. 22,no. 8, pp. 809–830, Aug. 2000. 4

[16] C. Stauffer and W.E.L. Grimson, “Learning patterns of activ-ity using real-time tracking,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 22, no. 8, pp. 747–757, Aug. 2000. 4

[17] S.J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, andH. Wechsler, “Tracking groups of people,” Computer Vi-sion and Image Understanding, vol. 80, no. 1, pp. 42–56,Oct. 2000. 4

[18] A. Elgammal, D. Harwood, and L.S. Davis, “Non-parametricmodel for background subtraction,” in Proceedings of IEEEICCV’99 FRAME-RATE Workshop, 1999. 4

[19] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “De-tecting objects, sahdows and ghosts in video streams by ex-ploiting color and motion information,” in Proc. of IEEEInt’l Conference on Image Analysis and Processing, 2001,pp. 360–365. 4

[20] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detect-ing moving objects, ghosts and shadows in video streams,”IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 25, no. 10, pp. 1337–1342, Oct. 2003. 4

[21] A. Prati, I. Mikic, M.M. Trivedi, and R. Cucchiara, “De-tecting moving shadows: Algorithms and evaluation,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 25, no. 7, pp. 918–923, July 2003. 4

7

Page 8: Event Driven Architecture for Distributed Surveillance ...imagelab.ing.unimore.it/files2/framework_04032010.pdf017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#****

CVPR#****

CVPR 2010 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

[22] Roberto Vezzani and Rita Cucchiara, “Ad-hoc: Appearancedriven human tracking with occlusion handling,” in FirstInternational Workshop on Tracking Humans for the Evalu-ation of their Motion in Image Sequences (THEMIS’2008),Leeds, UK, Sept. 2008. 5

[23] P. Viola and M. Jones, “Rapid object detection using aboosted cascade of simple features,” in 2001 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition (CVPR 2001), with CD-ROM, 8-14 December 2001,Kauai, HI, USA. 2001, vol. 1, pp. 511–518, IEEE ComputerSociety. 7

[24] W. Kienzle, G. Bakir, M. Franz, and B. Scholkopf, “Facedetection - efficient and rank deficient,” Advances in NeuralInformation Processing Systems, vol. 17, pp. 673–680, 2005.7

[25] Rita Cucchiara, Costantino Grana, Andrea Prati, and RobertoVezzani, “Probabilistic posture classification for human be-haviour analysis,” IEEE Transactions on Systems, Man, andCybernetics, Part A: Systems and Humans, vol. 35, no. 1, pp.42–54, Jan. 2005. 7

[26] Roberto Vezzani, Massimo Piccardi, and Rita Cucchiara,“An efficient bayesian framework for on-line action recogni-tion,” in Proceedings of the IEEE International Conferenceon Image Processing, Cairo, Egypt, Nov. 2009. 7

[27] Simone Calderara, Rita Cucchiara, and Andrea Prati, “Ac-tion signature: a novel holistic representation for actionrecognition,” in 5th IEEE International Conference On Ad-vanced Video and Signal Based Surveillance (AVSS2008),Santa Fe, New Mexico, Sept. 2008. 7

[28] Simone Calderara, Andrea Prati, and Rita Cucchiara,“Learning people trajectories using semi-directional statis-tics,” in Proceedings of IEEE International Conference onAdvanced Video and Signal Based Surveillance (IEEE AVSS2009), Genova, Italy, Sept. 2009. 7

[29] Paolo Piccinini, Simone Calderara, and Rita Cucchiara, “Re-liable smoke detection system in the domains of image en-ergy and color,” in in press on 6th International Conferenceon Computer Vision Systems, Vision for Cognitive Systems,2008. 7

8