motion analysis for event detection and tracking with a...

Motion Analysis for Event Detection and Tracking with aMobile Omni-Directional Camera

Tarak GandhiComputer Vision and Robotics Research

LaboratoryUniversity of California – San Diego

[email protected]

Mohan M. TrivediComputer Vision and Robotics Research

LaboratoryUniversity of California – San Diego

[email protected]

ABSTRACTA mobile platform mounted with Omni-Directional VisionSensor (ODVS) can be used to monitor large areas and de-tect interesting events such as independently moving per-sons and vehicles. To avoid false alarms due to extraneousfeatures, the image motion induced by the moving platformshould be compensated. This paper describes a formula-tion and application of parametric ego-motion compensa-tion for an ODVS. Omni images give 360 degrees view ofsurroundings but undergo considerable image distortion. Toaccount for these distortions, the parametric planar motionmodel is integrated with the transformations into omni im-age space. Prior knowledge of approximate camera cali-bration and camera speed are integrated with the estima-tion process using Bayesian approach. Iterative, coarse tofine, gradient based estimation is used to correct the motionparameters for vibrations and other inaccuracies in priorknowledge. Experiments with camera mounted on varioustypes of mobile platforms demonstrate successful detectionof moving persons and vehicles.

KeywordsMotion detection, Optical flow, Panoramic vision, Dynamicvision, Mobile robots, Intruder detection, Surveillance

1. INTRODUCTION AND MOTIVATIONComputer vision researchers have for long recognized theimportance of visual surveillance related applications whilepursuing some of the outstanding research issues in dynamicscene analysis, motion detection, feature extraction, patternand activity analysis and biometric systems. Recent worldevents are demanding practical and robust deployment ofvideo based solutions for a wide range of applications [6, 9,25, 28]. Such wider acceptance of the need of the technologydoes not mean that these systems are indeed ready for de-ployment. There are many important and difficult researchproblems that are still outstanding. In this paper we present

a research study focused on one of such challenging researchproblem, that of developing an autonomous system whichcan serve as a “Mobile Sentry” to perform the tasks that aperson posted on a guard duty around a perimeter of a basegets assigned. The mobile sentry with video cameras shouldbe able to detect “interesting” events, record and report thenature and location of the event in real-time for further pro-cessing by a human operator. This is indeed an ambitiousgoal and it requires resolution of several important problemsfrom computer vision and intelligent robotics area. In thispaper we are focused on the problem of how to detect andcompensate for the ego motion of the mobile platform. Oneof the novel feature of our research is the omnidirectionalvideo streams we use as the input to the vision system.

When the camera is stationary, background subtraction isoften used to extract moving objects [16, 32]. However,when the camera is moving, the background also under-goes ego-motion, which should be compensated. To dis-tinguish objects of interest from extraneous features on theground, the ground is usually approximated by a planar sur-face, whose ego-motion is modelled using a projective trans-form [10, 24] or its linearized version. Using this model,the ego-motion of the ground can be compensated in orderto separate the objects with independent motion or height.This approach has been widely used for object detectionfrom moving platforms [5, 11, 24].

Omni-Directional Vision Sensors (ODVS) or omni-cameraswhich give a 360 degree field of view of the surroundings arevery useful for applications in surveillance and robot naviga-tion [3]. Motion estimation from moving ODVS cameras hasrecently been a topic of great interest. Rectilinear camerasusually have a smaller field of view, due to which the focusof expansion often lies outside the image, causing motionestimation to be sensitive to the camera orientation. Also,the motion field produced by translation along horizontaldirection is similar to that due to rotation about verticalaxis. As noted by Gluckman and Nayar [14], ODVS cam-eras avoid both these problems due to their wide field ofview. They project the image motion on a spherical surfaceusing Jacobians of transformations to determine ego-motionof a moving platform terms of translation and rotation of thecamera. Vassalo et al. [34] use the spherical projection todetermine ego-motion of a moving platform terms of transla-tion and rotation of the camera. Experiments are performedusing robotic platforms in indoor environment and the ego-

CVRR

ACM Multimedia Systems Journal, Special Issue on Video Surveillance. Accepted Jan 2004.

motion estimates are compared with those from odometry.Shakernia et al. [29] use the concept of back-projection flow,where the image motion is projected to a virtual curvedsurface in place of spherical surface and make the Jacobiansof transformation simpler. Using this concept, they haveadapted ego-motion algorithms designed to planar surfacefor use with ODVS sensors. Results using simulated imagesequences show the basic feasibility of the approach.

In our own research, the emphasis is on robustness, effi-ciency, and applicability in outdoor environments encoun-tered in surveillance and physical or base security. The maincontribution of this paper is to perform detection of eventssuch as independently moving persons and automobiles fromODVS image sequences, and apply it to video sequences ob-tained from a moving platform for surveillance applications.

2. EGO-MOTION COMPENSATION FRAME-WORK FOR ODVS VIDEO

Parametric motion estimation based on image gradients,also known as the “direct method” has been used for rec-tilinear cameras in for planar motion estimation, obstacledetection and motion segmentation [20, 23]. The advantageof the direct methods is that they can use motion informa-tion not only from corner-like features, but also edges, whichare usually more numerous in an image. Here, the conceptof direct method is extended for use with ODVS. This ap-proach was also used for detecting surrounding vehicles froma moving car in [18]. An ODVS gives full 360 degree view ofthe surroundings, which reduces the motion ambiguities of-ten present in the rectilinear cameras. However, the imagesundergo considerable distortion which should be accountedfor during motion estimation.

The block diagram of the event detection system is shown inFigure 1. The initial estimates of the the ground plane mo-tion parameters are obtained using the approximate knowl-edge about the camera calibration and speed. Using theseparameters, one of the frames is warped towards anotherto compensate the motion of the ground plane. However,the motion of features having independent motion or heightabove the ground plane is not fully compensated. To de-tect these features, the normalized image difference betweenthe two images is computed using temporal and spatial gra-dients. This suppresses the features on ground plane andenhances the interesting objects. Morphological and otherpost-processing is performed to further suppress the groundfeatures due to any residual motion and get the positionsof the objects. The detected objects are then tracked overframes to form events.

However, the calibration and speed of the camera may not beknown accurately. Furthermore, the camera could vibrateduring the motion. Due to this, the motion of the groundplane may not be fully compensated, leading to misses andfalse alarms. In order to improve the performance, motionparameters are iteratively corrected using the spatial andtemporal gradients of the motion compensated images us-ing optical flow constraint in coarse to fine framework. Themotion information contained in these gradients is optimallycombined with the prior knowledge of motion parameters us-ing a Bayesian framework similar to [24]. Robust estimationis used to separate the ground plane features from other fea-

Figure 1: Event detection and recording systembased on ego-motion compensation from a movingplatform: Block Diagram

tures. The following sections deal with the individual blocksdescribed above, along with the appropriate formulation forODVS.

3. ODVS MOTION TRANSFORMATIONSTo compensate the motion of the omni-directional camera,the transformation due to ODVS should be combined withthat due to motion. These transforms are discussed below:

3.1 Flat plane transformationThe ODVS used in this work consists of a hyperbolic mirrorand a camera placed on its axis. It belongs to a class ofcameras known as central panoramic catadioptric cameras[3]. These cameras have a single viewpoint that permitsthe image to be suitably transformed to obtain perspectiveviews. Figure 2 (a) shows a photograph of an ODVS mirror.An image from a camera mounted with an ODVS mirror isshown in Figure 2 (b). It is seen that the camera covers a 360degrees field of view around its center. However, the imageit produces is distorted with straight lines transformed intocurves. A flat plane transformation is applied to the imageto produce a perspective view looking downwards as shownin Figure 2 (c), where the distortion is considerably reduced.Details of this transformation are discussed below.

(a)

(b) (c)

Figure 2: (a) A typical image from the ODVS. (b)Transformation to a perspective plan view.

Figure 3: Omni-directional camera geometry

The geometry of a hyperbolic ODVS is shown in Figure 3.According to the mirror geometry, a the light ray from theobject towards the viewpoint at the first focus O is reflectedso that it passes through the second focus, where a con-ventional rectilinear camera is placed. The equation of thehyperboloid is given by:

(Z − c)2

a2− X2 + Y 2

b2= 1 (1)

where c =√

a2 + b2.

Let P = (X, Y, Z)t denote the homogenous coordinates ofthe perspective transform of any 3-D point λP on ray OP ,where λ is the scale factor depending on the distance of the3-D point from the origin. It can be shown [1, 19, 29] thatthe reflection in mirror gives the point −p = (−x,−y)t onthe image plane of the camera using the flat-plane transformF :

F (P ) = p =

(xy

)=

q1

q2Z + q3‖P‖(

XY

)(2)

where

q1 = c2 − a2, q2 = c2 + a2, q3 = 2ac (3)

‖P‖ =√

X2 + Y 2 + Z2 (4)

Note that the expression for image coordinates p is indepen-dent of the scale factor λ. The pixel coordinates w = (u, v)t

are then obtained by using the calibration matrix K of theconventional camera composed of the focal lengths fu, fv,optical center coordinates (u0, v0)

t, and camera skew s.(

w1

)= K

(p1

)(5)

or

uv1

=

fu s u0

0 fv v0

0 0 1

xy1

(6)

This transform can be used to warp an omni image to aplan perspective view. To convert a perspective view backto omni view, the inverse flat-plane transform p can be used:

(p1

)=

xy1

= K−1

uv1

(7)

F−1(p) = P =

XYZ

=

q1xq1y

q2 − q3

√x2 + y2 + 1

(8)

It should be noted that the transformation of omni to per-spective view involves very different magnifications in dif-ferent parts of the image. Due to this, the quality of theimage deteriorates if the entire image is transformed at atime. Hence, it is desirable to perform motion estimationdirectly in the ODVS domain, but use the above transfor-mations to map the locations to the perspective domain asrequired.

3.2 Planar motion transformationTo detect objects with motion or height, the motion of theground is modeled using planar motion model [10, 22]. LetPa and Pb denote the perspective transforms of a point onthe ground plane in the homogenous coordinate systems cor-responding to two positions A and B of the moving camera.These are related by:

λbPb = λaRPa + D = λa [RPa + D/λa] (9)

where R and D denote the rotation and translation betweenthe camera positions, and λa, λb depend on the distanceof the actual 3-D point. Let the ground plane satisfy thefollowing equation at the camera position A:

λaKtPa = 1 (10)

or

1/λa = KtPa (11)

Substituting the value of 1/λa from equation (11) in equa-tion (9), it is seen that Pa and Pb are related by a projectivetransform:

λbPb = λa

[R + DKt] Pa = λaHPa (12)

Figure 4: Transforming a pixel from omni image Ato omni image B using (1) Inverse calibration ma-trix K−1, (2) Inverse flat plane transform F−1 (3)Projective transform H for planar motion from A toB (4) Flat plane transform F (5) Calibration matrixK

or Pb ≡ HPa within a scale factor. This relation has beenwidely used to estimate planar motion for perspective cam-eras.

For performing motion compensation using omni-directionalcameras, the above projective transform should be combinedwith the flat-plane transform as well as camera calibrationmatrix to warp every point in one image towards another.The complete transform for warping is shown in Figure 4.

4. PARAMETRIC MOTION ESTIMATIONFOR ODVS

This section describes the main contribution of the paper.Direct methods based on image gradients have been ap-plied for estimating the motion parameters for rectilinearcameras [4, 20]. Here, the direct method is generalized forODVS cameras. Information from image gradients is com-bined with the a-priori known information about the cameramotion and calibration in Bayesian framework to obtain op-timal estimates of motion parameters for ego-motion com-pensation.

4.1 Use of optical flow constraintUnder favorable conditions, the spatial gradients (gu, gv),the temporal gradient gt, and the residual image motion(∆u, ∆v)t after current motion compensation satisfy the op-tical flow constraint [17].

gu∆u + gv∆v + gt = 0 (13)

However, there is only one equation between two unknownsfor each point. Due to this, only the normal flow - i.e.,flow in the direction of the gradient - can be determined us-ing a single point. This is known as the aperture problem,illustrated in Figure 5 (a). To solve this problem, Lucas

(a) (b)

Figure 5: Aperture problem. (a) In the case of anedge, only the component of motion normal to theedge can be determined. (b) In the case of a corner,the aperture problem is avoided, and the motion canbe uniquely determined.

and Kanade [26] assumed that the image motion is approxi-mately constant in a small window around every point. Us-ing this constraint, more equations are obtained using theneighboring points, and the full optical flow can be esti-mated using least squares. Such an estimate is reliable nearcorner-like points where window has gradients in differentdirections. This is as seen in Figure 5 (b). This method hasbeen used by Kanade, Lucas and Tomasi [30] to find andtrack corner-like features over an image. However, in caseof ODVS the assumption of uniform optical flow needs tobe modified due to the non-linear ODVS transform. Dani-ilidis [7] has generalized the optical flow estimation to ODVScamera.

However, this approach would use the motion informationonly at corner-like features. However, the edge features alsohave motion information. To use this information, the imagegradients can be used directly to estimate the model param-eters. This approach is known as direct method of motionestimation in literature and has been extensively used in ob-stacle detection using rectilinear cameras [4, 20]. Usually, alinearized version of projective transform is used:

∆u = a1u + a2v + a3 + a7u2 + a8uv

∆v = a4u + a5v + a6 + a7uv + a8v2 (14)

The expressions of image motion are substituted into theoptical flow constraint in equation (13) to give:

gu(a1u+a2v+a3+. . .)+gv(a4u+a5v+a6+. . .)+gt = 0 (15)

This gives one equation for every point in 8 parameters, thatcan be solved using linear least squares. Since the quadraticparameters are more sensitive to noise, a 6-parameter affinemodel is also used.

4.2 Generalization for ODVSTo apply the motion estimation to ODVS camera, the non-linear flat-plane transform is used to go from omni to per-spective domain and back. Since non-linearity has to bedealt with anyway, the projective transform H is used in-stead of a linear model so that large motions can be handledbetter. The motion parameters in the projective transform

are parameterized as:

h =(

h1 h2 h3 h4 h5 h6 h7 h8

)t(16)

with

H

H33=

h1 h2 h3

h4 h5 h6

h7 h8 1

(17)

The optical flow constraint equation is satisfied only forsmall image displacements up to 1 or 2 pixels. To estimatelarger motions, a coarse to fine pyramidal framework [21,31] is used. In this framework, a multi-resolution Gaussianpyramid is constructed for adjacent images in the sequence.The motion parameters are first computed at the coarsestlevel, and the image points at the next finer level are warpedusing the computed motion parameters. The residual mo-tion is computed at the finer level, and the process is re-peated until the finest level. Even within each level, multipleiterations of warping and estimation can be performed.

Let h be the actual value of the motion parameter vector,and h be the current estimate. Using the current estimate,the second image B is warped towards the first image Ato get the warped image B′. Then, the transformation be-tween A and B′ can be expressed approximately in terms of∆h = h− h. Let wa = (ua, va)t be the projection of a pointon the planar surface in image A. Then, the projection wb

of the same point in warped image B′ is a function of wa aswell as ∆h, given using a composition of operations shownin Figure 4. The optical flow constraint between images Aand B′ is then given by:

(gu gv

)[wb′ − wa] = −gt + η (18)

where η accounts for the random noise in the temporal imagegradient. For N points on the planar surface, the constraintscan be expressed in a matrix form:

∆z = c(∆h) + v (19)

where every row i of the equation represents the constraintfor a single image point with

ci(∆h) =(

gu gv

)i[wb′(wa;∆h)− wa] (20)

∆zi = −(gt)i , vi = ηi

Due to the flat plane and the projective transforms, thefunction c(·) is a non-linear. Hence, state estimate h and itscovariance P are iteratively updated using the measurementupdate equations of the iterated extended Kalman filter [2],with C denoting the Jacobian matrix of c(·).

P ← [γCtR−1C + P−1

−]−1

(21)

h ← h + ∆h = h + P[γCtR−1∆z−P−1

− (h− h−)]

(22)

where R is the covariance of the temporal gradient measure-ments, h− is the prior value of state obtained from cameracalibration and velocity, and P− is the prior covariance. Thematrix R is taken as a diagonal matrix to simplify calcula-tions. However, this would mean assuming that the pixelgradients are independent, which may not really be the casesince gradients are computed from multiple pixels. Hence,the factor γ ≤ 1 is used to accommodate the interdepen-dence of the gradient measurements.

To compute the Jacobian C, each row Ci is expressed usingchain rule:

Ci =

(∂ci

∂h

)=

(∂c

∂wb

∂wb

∂pb

∂pb

∂Pb

∂Pb

∂h

)

i

(23)

where Pb = (Xb, Yb, Zb)t, pb = (xb, yb)

t and wb = (ub, vb)t

are the coordinates of point i in in the mirror, image, andpixel coordinate systems for camera position B.

Differentiating equation (20) w.r.t. wb gives:(

∂c

∂wb

)

i

=(

gu gv

)i

(24)

The calibration equation (6) can be differentiated to obtain:(

∂wb

∂pb

)

i

=

(fu s0 fv

)(25)

The Jacobian of the flat plane transform is obtained by dif-ferentiating equation (2) at P = Pb as:

(∂pb

∂Pb

)

i

=

(∂xb∂Xb

∂xb∂Yb

∂xb∂Zb

∂yb∂Xb

∂yb∂Yb

∂yb∂Zb

)(26)

=1

(q2Zb + q3‖Pb‖)i ‖Pb‖i

·(

q3xbXb − q1‖Pb‖ q3xbYb q3xbZb

q3ybXb q3ybYb − q1‖Pb‖ q3ybZb

)

i

Since the ODVS transforms giving pa and pb do not changeif the homogenous coordinates Pa and Pb are changed byscale factor, we can scale the right hand side of equation (12)to give:

Pb =1

H33HPa =

h1Xb + h2Yb + h3Zb

h4Xb + h5Yb + h6Zb

h7Xb + h8Yb + Zb

(27)

Taking the Jacobian w.r.t. h = (h1 . . . h8) gives:

(∂Pb

∂h

)

i

=

(Xb Yb Zb 0 0 0 0 00 0 0 Xb Yb Zb 0 00 0 0 0 0 0 Xb Yb

)

i

(28)

4.3 Outlier removalThe estimate given above is optimal only when all pointsreally belong to the planar surface, and the underlying noisedistributions are Gaussian. However, the estimation is highlysensitive to the presence of outliers, i.e. points not satisfy-ing the ground motion model. These features should beseparated using a robust method. To reduce the number ofoutliers, the region of interest of road is determined usingcalibration information, and the processing is done only inthat region to avoid extraneous features. To detect outliersan approach similar to the data snooping approach discussedin [8] has been adapted for Bayesian estimation. In this ap-proach, the error residual of each feature is compared withthe expected residual covariance at every iteration, and thefeatures are reclassified as inliers or outliers.

If a point zi is not included in the estimation of h – i.e. iscurrently classified as an outlier – then the covariance of itsresidual is:

Cov[∆zi −Ci∆h

]' R + CiPCt

i (29)

Figure 6: Hierarchical motion estimation algorithm.

However, if zin is included in the estimation of h – i.e. iscurrently classified as an inlier – then it can be shown thatthe covariance of its residual is given by:

Cov[∆zi −Ci∆h

]' R−CiPCt

i < R (30)

Hence, to classify in the next iteration, the Mahalanobisnorm of the residual is compared with a threshold τ . Forpoint currently classified as inlier the following condition isused:

[∆zi −Ci∆h

] [R + CiPCt

i

]−1[∆zi −Ci∆h

]< τ (31)

For point currently classified as inlier the covariance R isused in practice instead of R − CiPCt

i in order to avoidnon-positive definite covariance because of approximationsdue to non-linearities. This would somewhat increase theprobability of classifying as an outlier instead of inlier, whichis to be on safer side.

[∆zi −Ci∆h

]R−1

[∆zi −Ci∆h

]< τ (32)

Note that this method is effective only when there is someprior knowledge about the motion parameters, otherwise theprior covariance P− would become infinite. If there is noprior knowledge, robust estimators can be used as in [27].The motion parameter estimation algorithm is shown in Fig-ure 6.

5. DYNAMIC EVENT DETECTION ANDTRACKING

After motion compensation, the features on the ground planewould be aligned between the two frames, whereas those dueto stationary and moving objects would be misaligned. Im-age difference between the frames would therefore enhancethe objects, and suppress the road features. However, theimage difference depends on residual motion as well as thespatial gradients at that point. In highly textured regions,the image difference would be large even for small residualmotion, and in less textured regions, the image differencewould be small even for large residual motion. To com-pensate this effect, normalized frame difference [33] is used.This is given at each pixel by:

∑gt

√g2

u + g2v

k +∑

(g2u + g2

v)(33)

where gu, gv are spatial gradients, gt is the temporal gradi-ent. Constant k is used to suppress the effect of noise inhighly uniform regions. The summation is performed over aK × K neighborhood of each pixel. In fact, the normalizeddifference is a smoothed version of the normal optical flow,and hence depends on the amount of motion near the point.Blobs corresponding to object features are obtained usingmorphological operations.

Nearby blobs are clustered into one, and the cluster cen-troids are tracked from frame to frame by an algorithm sim-ilar to [13]. For tracking, a list containing the frame number,unique ID, position, and velocity of each track is maintained.The list is empty in the beginning. The following steps areperformed to associate the tracks with features:

• At each frame, associate each existing tracks with thenearest feature in a neighborhood window around thetrack position. Use Kalman filter [2] to update thetrack with the feature. If no feature is found in theneighborhood window, only time update is performed.

• For features not having a track in their neighborhood,create a new track out of the feature and update it inthe next frame.

• To keep the number of tracks within bound, deletethe weakest tracks when the number of tracks get toolarge.

• Merge the tracks that are very close to each other andhave nearly the same velocity and are therefore as-sumed to be from same object.

• Display tracks which have sustained for a stipulatednumber of frames along with the track history.

For each track that survives over a minimum number offrames, the original ODVS image is used to generate a per-spective view [19] of the event around the center of thebounding box.

6. EXPERIMENTAL VALIDATION ANDRESULTS

A series of experimental trials were conducted to systemati-cally evaluate the capabilities and performance of the mobile

Figure 7: An ODVS camera mounted on an electriccart for a mobile sentry experimental run.

sentry system for event detection and tracking. Three dif-ferent types of ODVS mountings were utilized to examinethe generality and functionalities of the mobile sentry sys-tem. The first trial involved a camera on an electric vehicle,the second was using a mobile robot, and the third involveda walking person with a helmet mounted ODVS. A relatedapplication applying a similar approach to an automobilemounted ODVS is also shown.

6.1 Camera on Electric CartThe first experimental trial of the mobile sentry utilized theODVS camera mounted on an electric cart as shown in Fig-ure 7. The cart was driven on a campus road at speedsbetween 2 and 7 miles/hour (approx. 1 to 3 m/s). Theapproximate speed of the cart was determined using GPSand used as a-priori motion estimate. However, there wereconsiderable vibrations of the camera. Using the parametricmotion estimation process, the effect of these vibrations wassuppressed. It was also observed that the ellipse correspond-ing to the entire FOV of the ODVS was oscillating, possiblydue to relative vibrations between the camera and the mir-ror, or the automatic motion stabilization in the camera.These were suppressed by estimating the center of the FOVellipse using a Hough transform, and translating it to a fixedposition.

Figure 8 (a) shows an image from the ODVS video sequence.The estimated parametric motion is shown using red arrows.Figure 8 (b) shows the classification of points into inliers(gray), outliers (white), and unused (black) points. It shouldbe noted that the outliers are usually identified when theedges are perpendicular to the motion. When an edge isparallel to the motion, aperture problem makes it difficultto identify it. The estimation is done only using the inlierpoints. Image with the normalized frame difference betweenthe motion compensated frames is shown in Figure 8 (c).It is seen that the independently moving car and personstand out whereas the stationary features on ground areattenuated in spite of ego-motion. Figure 8 (d) shows thebounding boxes around the moving car and person afterpost-processing.

Table 1: Performance evaluation. The right columnshows the ground truth number of relevant eventsin the image sequence. The other columns show thenumber and percentage of events detected by thesystem. The last three rows show the number offalse alarms due to stationary objects and shadows.Ground truth is not relevant here.

Minimum 15 10 Groundtrack length frames frames truthTotal Events 14 (74%) 17 (89%) 19– Persons 9 (90%) 9 (90%) 10– Vehicles 5 (55%) 8 (89%) 9Total false alarms 3 4 N.A.– Stationary objects 1 2 N.A.– Shadows 2 2 N.A.

Since the algorithm uses a planar motion model, stationaryobjects above the ground induce motion parallax, and aredetected if they are sufficiently close to the camera, and in-cluded in the region of interest. Figure 9 shows the detectionof a stationary structure as well as a moving person. Onlythe parts within the region of interest are detected.

The centroids of the detected bounding boxes were trackedover time and the tracks that survived over 10 or moreframes were identified. Typical snapshots from these trackswere taken, and the distortion due to ODVS was correctedto get the perspective view looking towards the track po-sition as in [19]. Figure 10 show the snapshots from thesetracks, detecting the events.

To evaluate the algorithm performance, the detection resultswere compared with ground truth obtained by manually ob-serving the video sequence. The performance was comparedfor two different thresholds on the number of frames in whicha track has to survive to be detected as an event. Table 1shows the detection rate in terms of total number of events(ground truth), and the number of events actually detected.Note that stationary obstacles and shadows are classified as“false alarms”, since they are currently not separated fromindependently moving objects. Lower threshold increasesdetection rate, but also increases false alarms. It was ob-served that an events corresponding to a moving person andcart were not detected at all due to the following reasons.The person and cart were quite far and especially the per-son had a small size in the image. Furthermore, the cameravehicle was turning, inducing considerable rotational ego-motion. Also, the objects were near the boundary of theregion of interest that was analyzed. An image of this per-son is shown in Figure 11.

Attributes such as the time, duration and position of theevents were extracted. The camera position at the eventtime was extracted from the onboard GPS. Assuming thatthe point nearest to the camera lies on the ground, the eventlocation with reference to the camera could be computed.These were added to the camera position coordinates to ob-tain the event position. the approximate positions of thecamera as well as the actual event were mapped as show inin Figure 12. Table 2 summarizes some of the events andtheir attributes.

(a)

(b)

(c)

(d)

Figure 8: Detection of moving objects from anODVS mounted on an electric cart: (a) Estimatedparametric motion of ground plane. Parts of theimage corresponding to the cart as well as distantobjects are excluded from motion estimation pro-cess. (b) Features used for estimation. Gray fea-tures are inliers, and white features are outliers.(c) Motion compensated difference image (d) Post-processed image showing detection of the movingcar and person. The angle made with X-axis in de-grees is also shown.

(a)

(b)

(c)

(d)

Figure 9: Detection of stationary structure in addi-tion to a moving person. Note that only the partof the structure within the region of interest is de-tected.

(a)

(b)

Figure 10: Captured events with their IDs (a) De-tected persons and vehicles. (b) Shadows and sta-tionary obstacles currently considered false alarms.

Figure 11: Original image corresponding to a missedevents. The moving person and vehicle were farfrom the camera and near the boundary of region ofinterest. Also, the camera vehicle was taking a turn,inducing considerable rotational ego-motion in theimage.

Figure 12: Map showing the position of the eventsin red and the camera position at that time in blue.The ego-vehicle track is marked as yellow line. Theevent IDs are labeled in black.

Table 2: Summarization of event attributes. Theleft image shows the original ODVS image at thetime of the event, middle image shows the output ofdetection algorithm, and the right image shows thesnapshot of the event corrected for ODVS distortion

Event ID: 65

Event time (snapshot): 16:01:08.3

Event duration [seconds]: 3.6

Event position [meters]: (7.1, -6.6)

Camera position [meters]: (7.8, -5.3)

Event ID: 109

Event time (snapshot): 16:01:40.0

Event duration [seconds]: 1.9

Event position [meters]: (53.1, 1.8)

Camera position [meters]: (56.0, -3.1)

Figure 13: Mobile Interactive Avatar: Semi-autonomous robotic system used for evaluating mo-bile sentry.

6.2 Camera on a Mobile RobotThe second experimental configuration was a robotic plat-form designed in our laboratory. This platform is calledMobile Interactive Avatar (MIA) [15] in which cameras, dis-plays can be mounted on semi-autonomous robot as shownin Figure 13 to interact with people at a distance. The robotwas driven around the corridor of our building with peoplewalking around it. Figure 14 shows the detection of movingpersons in one of the frames. Snapshots of detected peopleshown in Figure 15. However, it was noted that the speed ofthe robot was much smaller than that of people, which wouldmean that simpler methods could also yield good results inthis scenario. Also, the height of the robot was small, andpeople’s faces could not be effectively captured.

6.3 Helmet mounted cameraThe third experimental study for mobile sentry involveda person walking with an ODVS mounted on a helmet asshown in Figure 16. In this configuration, the camera heightwas approximately 2 meters, which enabled easy capture ofpeople’s faces. The speed of the person with helmet wascomparable to that of other people. There was also a con-siderable camera rotation due to head and body movement,which helped to test the algorithm in presence of large ro-tational ego-motions. Figure 17 shows the detection of amoving person in one of the frames. To reduce estimationerrors, the region of interest for motion analysis was trun-cated to remove the camera’s own image, as well as theobjects above the horizon. It is seen that there is signifi-cant rotational motion between two frames. In spite of thismotion, the moving person is separated from backgroundfeatures such as the lines on the ground. However, if therotational motion is too large, the detection often deterio-rates and tracks get split into parts. Some of the snapshotsof detected people shown are in Figure 18.

6.4 Vehicle mounted ODVSIn a related application [18], the event detection approachwas applied to an omni camera mounted on an automobile.The actual vehicle speed, obtained from CAN bus was usedfor initial motion estimate. Algorithm similar to the abovedescribed one was used to detect vehicles on both sides ofthe car were detected as shown in Figure 19.

(a)

(b)

(c)

(d)

Figure 14: Detection of moving persons in an im-age sequence from a mobile robot. (a) Estimatedparametric motion of ground plane. The part of theimage in center, which images the camera itself isnot used for estimation. (b) Features used for esti-mation. Gray features are inliers, and white featuresare outliers. (c) Normalized image difference aftermotion compensation. The moving person is de-tected but the lines on the ground are not detected.(d) Postprocessing and tracking output. The trackof the detected person with the ID is shown withyellow line.

(b)

Figure 15: Some of the interesting events capturedby the mobile robot.

Figure 16: ODVS camera mounted on a helmet.This configuration enables acquisition of snapshotsof surrounding people including their faces. How-ever, there is considerable camera rotation due tomovement of head and body which should be com-pensated.

(a)

(b)

(c)

(d)

Figure 17: Detection of a moving person in an imagesequence from a helmet mounted camera. (a) Esti-mated parametric motion of ground plane. There issignificant rotational motion that is estimated by thealgorithm. (b) Features used for estimation. Grayfeatures are inliers, and white features are outliers.The parts of the image in center, above horizon, andthose having small image gradients are not used inestimation. (c) Normalized image difference aftermotion compensation. The moving person is de-tected but the lines on the ground are not detected.(d) Postprocessing and tracking output. The trackof the detected person with the ID is shown withyellow line. The position coordinates are computedby assuming that the point on the blob nearest tothe camera is on ground.

Figure 18: Some of the detected events from thehelmet mounted camera.

Figure 19: Detection of moving vehicles in an imagesequence using omnidirectional camera mounted ona moving car. The track history of the vehicle overa number of frames is marked. The track id and thecoordinates in road plane are also shown.

7. SUMMARY AND CONCLUDINGREMARKS

This paper described an approach for event detection us-ing ego-motion compensation from mobile omni-directional(ODVS) cameras. It applied the concept of direct motionestimation using image gradients to ODVS cameras. Themotion of the ground was modeled as planar motion, andthe features not obeying the motion model were separatedas outliers. An iterative estimation framework was usedfor optimally fusing the motion information in image gra-dients with a-priori known information about the cameramotion and calibration. Coarse to fine motion estimationwas used and the motion between the frames was compen-sated at each iteration. A scheme based on data snoopingwas used to remove outliers. Experiments were performedby obtaining image sequences from various types of mobileplatforms, and detecting events such as moving persons andautomobiles.

The motion estimation approach described in the paper canbe extended to use multiple motion models for estimation.In case of scenes with multiple stationary planar surfaces,the surfaces have the same parameters for rotation and trans-lation, but different planar normals [12]. Hence, the ego-motion can be parameterized directly in terms of the linearand angular velocity of the camera, and the plane normals

of each planar surface. An iterative estimation procedurewhich estimates each planar surface separately, but uses theestimates it obtains for rotation and translation as start-ing points for estimating other planar surfaces could makethe process more robust to outliers. For example, if thescene consists of features far away from the camera, theirego-motion could be considered almost pure rotation, hav-ing only three degrees of freedom. These features could beused to estimate the approximate rotation [33]. This rota-tion could be used as an initial estimate for the parts of thescene containing ground plane in order to determine the fullplanar motion. This procedure could be combined with arobust motion segmentation method such as [27] to auto-matically separate multiple planar surfaces.

8. ACKNOWLEDGEMENTSWe are thankful for the grant awarded by the TechnicalSupport Working Group (TSWG) of the US Departmentof Defense which provided the primary sponsorship of thereported research. We also thank our colleagues from theUCSD Computer Vision and Research Laboratory for theircontributions and support.

9. REFERENCES[1] O. Achler and M. M. Trivedi. Real-time traffic flow

analysis using omnidirectional video network andflatplane transformation. In World Congress onIntelligent Transportation Systems, Chicago, IL, 2002.

[2] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan.Estimation with applications to tracking andnavigation. John Wiley and Sons, 2001.

[3] R. Benosman and S. B. Kang. Panoramic Vision:Sensors, Theory, and Applications. Springer, 2001.

[4] M. J. Black and P. Anandan. The robust estimation ofmultiple motions: Parametric and piecewise-smoothflow fields. Computer Vision and ImageUnderstanding, 63(1):75–104, 1996.

[5] S. Carlsson and J. O. Eklundh. Object detection usingmodel-based prediction and motion parallax. InEuropean Conference on Computer Vision, pages297–306, April 1990.

[6] E. Chang and Y.-F. Wang, editors. First ACMInternational Workshop on Video Surveillance,Berkeley, CA, November 2003.

[7] K. Daniilidis, A. Makadia, and T. Bulow. Imageprocessing in catadioptric planes: Spatiotemporalderivatives and optical flow computation. In IEEEWorkshop on Omnidirectional Vision, pages 3–12,June 2002.

[8] G. Danuser and M. Stricker. Parametric model fitting:From inlier characterization to outlier detection. IEEETransactions on Pattern Analysis and MachineIntelligence, 20(2):263–280, March 1998.

[9] K. L. Dean. Smartcams take aim at terrorists. WiredNews, June 2003.http://cvrr.ucsd.edu/press/articles/

Wired News Smartcams.html.

[10] O. Faugeras. Three-Dimensional Computer Vision: AGeometric Viewpoint. The MIT Press, Cambridge,MA, 1993.

[11] T. Gandhi, S. Devadiga, R. Kasturi, and O. Camps.Detection of obstacles using ego-motion compensationand tracking of significant features. Image Vision andComputing, 18(10):805–815, 2000.

[12] T. Gandhi and R. Kasturi. Application of planarmotion segmentation for scene text extraction. In Int.Conf. on Pattern Recognition, volume 1, pages445–449, 2000.

[13] T. Gandhi, M.-T. Yang, R. Kasturi, O. Camps,L. Coraor, and J. McCandless. Detection of obstaclesin the flight path of an aircraft. IEEE Transactions onAerospace and Electronic Systems, 39(1):176–191,January 2003.

[14] J. Gluckman and S. Nayar. Ego-motion andomnidirectional cameras. In Proceedings of theInternational Conference on Computer Vision, pages999–1005, 1998.

[15] T. B. Hall and M. M. Trivedi. A novel interactivityenvironment for integrated intelligent transportationand telematic systems. In 5th International IEEEConference on Intelligent Transportation Systems,Singapore, September 2002.

[16] Haritaoglu, D. Harwood, and L. S. Davis. W4:Real-time surveillance of people and their activities.IEEE Transaction on Pattern Analysis and MachineIntelligence, 22(8):809–830, Aug. 2000.

[17] B. Horn and B. Schunck. Determining optical flow. InDARPA81, pages 144–156, 1981.

[18] K. Huang, M. M. Trivedi, and T. Gandhi. Driver’sview and vehicle surround estimation usingomnidirectional video stream. In IEEE Conference onIntelligent Vehicles, Columbus, OH, June 2003.

[19] K. C. Huang and M. M. Trivedi. Video arrays forreal-time tracking of persons, head and face in anintelligent room. Machine Vision and Applications,14(2):103–111, 2003.

[20] M. Irani and P. Anandan. A unified approach tomoving object detection in 2D and 3D scenes. IEEETransactions on Pattern Analysis and MachineIntelligence, 20(6):577–589, June 1998.

[21] B. Jahne, H. Haußecker, and P. Geißler. Handbook ofComputer Vision and Applications, volume 2,chapter 14, pages 397–422. Academic Press, SanDiego, CA, 1999.

[22] K. Kanatani. Geometric Computation for MachineVision. Oxford University Press, Oxford, 1993.

[23] Q. Ke and T. Kanade. Transforming camera geometryto a virtual downward-looking camera: Robustego-motion estimation and ground-layer detection. InProc. IEEE Conference on Computer Vision andPattern Recognition, volume I, pages 390–397, June2003.

[24] W. Kruger. Robust real time ground plane motioncompensation from a moving vehicle. Machine Visionand Applications, 11:203–212, 1999.

[25] S. Lawson. Yes, you are being watched. PC World,December 27, 2002.http://www.pcworld.com/news/article/

0,aid,108121,00.asp.

[26] B. Lucas and T. Kanade. An iterative imageregistration technique with an application to stereovision. In International Joint Conference on ArtificialIntelligence, pages 674–679, 1981.

[27] J. M. Odobez and P. Bouthemy. Direct incrementalmodel-based image motion segmentation for videoanalysis. Signal Processing, 66:143–145, 1998.

[28] D. Ramsey. Researchers work with public agencies toenhance super bowl security, February 15 2003.http://www.calit2.net/news/2003/

2-4 superbowl.html.

[29] O. Shakernia, R. Vidal, and S. Sastry. Omnidirectionalegomotion estimation from back-projection flow. InIEEE Workshop on Omnidirectional Vision (inconjunction with CVPR), June 2003.

[30] J. Shi and C. Tomasi. Good features to track. In IEEEConference on Computer Vision and PatternRecognition, pages 593–600, 1994.

[31] E. P. Simoncelli. Coarse-to-fine estimation of visualmotion. In Proc. Eighth Workshop on Image andMultidimensional Signal Processing, pages 128–129,Cannes, France, 1993.

[32] C. Stauffer and W. E. L. Grimson. Adaptivebackground mixture model for real-time tracking. InProceedings of IEEE Int’l Conference on ComputerVision and Pattern Recognition, pages 246–252, 1999.

[33] E. Trucco and A. Verri. Computer vision andapplications: A guide for students and practitioners.Prentice Hall, March 1998.

[34] R. F. Vassallo, J. Santos-Victor, and H. J. Schneebeli.A general approach for egomotion estimation withomnidirectional images. In IEEE Workshop onOmnidirectional Vision, pages 97–103, June 2002.

motion analysis for event detection and tracking with a...

Documents