computer vision algorithms for intersection …...tracking method for tracking free flowing traffic...

78 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 4, NO. 2, JUNE 2003

Computer Vision Algorithms forIntersection Monitoring

Harini Veeraraghavan, Osama Masoud, and Nikolaos P. Papanikolopoulos, Senior Member, IEEE

Abstract—The goal of this project is to monitor activities attraffic intersections for detecting/predicting situations that maylead to accidents. Some of the key elements for robust intersectionmonitoring are camera calibration, motion tracking, incidentdetection, etc. In this paper, we consider the motion-trackingproblem. A multilevel tracking approach using Kalman filter ispresented for tracking vehicles and pedestrians at intersections.The approach combines low-level image-based blob tracking withhigh-level Kalman filtering for position and shape estimation. Anintermediate occlusion-reasoning module serves the purpose ofdetecting occlusions and filtering relevant measurements. Motionsegmentation is performed by using a mixture of Gaussian modelswhich helps us achieve fairly reliable tracking in a variety ofcomplex outdoor scenes. A visualization module is also presented.This module is very useful for visualizing the results of the trackerand serves as a platform for the incident detection module.

Index Terms—Camera calibration, incident detection, motionsegmentation, occlusion reasoning, vehicle tracking.

I. INTRODUCTION

I NCIDENT monitoring in outdoor scenes requires reliabletracking of the entities in the scene. In this project, we are

interested in monitoring incidents at an intersection. The trackershould not only be able to handle the inherent complexities ofan outdoor environment, but also the complex interactions of theentities among themselves and with the environment.

This paper combines low-level tracking (using image ele-ments) with higher level tracking to address the problem oftracking in outdoor scenes. Reliable tracking requires that thetracked target can be segmented out clearly. This can be done byeither using models that describe the appearance of the target ora model describing the appearance of the background. For ourcase of outdoor vehicle tracking, where the tracked vehicles areunknown and quite variable in appearance (owing to the com-plexity of the environment) it is easier to build models for thebackground (which is relatively constant). The model should beable to capture the variations in appearance of the scene due tochanging lighting conditions. It should also be able to preventforeground objects from being modeled as background (e.g., theslow stop and go motion of vehicles in crowded intersections). A

Manuscript received December 16, 2002; revised September 15, 2003. Thiswork was supported by the ITS Institute at the University of Minnesota, theMinnesota Department of Transportation, and the National Science Foundationunder Grants CMS-0127893 and IIS-0219863. The Guest Editors for this paperwere R. L. Cheu, D. Srinivasan, and D.-H. Lee.

The authors are with the Artificial Intelligence, Vision and RoboticsLaboratory, Department of Computer Science and Engineering, Universityof Minnesota, Minneapolis, MN 55455 USA (e-mail: [email protected];[email protected]; [email protected]).

Digital Object Identifier 10.1109/TITS.2003.821212

poor model of the background results in effects like “ghosting”as shown in Fig. 1.

Tracking based on blobs (segmented foreground) thoughextremely computationally efficient, results in significantloss in information regarding the tracked entities due to itssimplified representation. This leads to tracking difficultiesdue to the target data association problem. We show thattracking can be improved significantly through more reliabledata association, by integrating cues from the image with theestimated shape and the motion of the tracked target itself. Weuse oriented bounding boxes as opposed to axis aligned boxeswhich captures information about the orientation of the blobsgiving a much tighter fit than the conventional axis alignedboxes. This is illustrated in Fig. 2. Higher level models ofthe target that capture its motion and shape across frames areconstructed. A Kalman filter is used for this purpose. Althoughseveral methods exist for modeling based on data, Kalmanfilters provide one of the best ways for doing real-time onlineprediction and estimation.

The low-level module which consists of blob tracking inter-acts with the image-processing module. The results from thislevel (tracked blobs) are passed onto the high-level where blobsare interpreted as moving objects (MOs). Shape estimation con-sists of estimating the dimensions of the bounding box and theposition of one corner point with respect to the blob centroid.The results from the shape estimator are used for occlusion rea-soning. A visualization tool has been developed for visualizingthe results of the tracking and the incident detection module.

The paper is arranged as follows: Section II discusses theproblem and the motivation for this work. Section III discussesthe related work in this area. The general tracking approach isdiscussed in Section IV. The Segmentation method is discussedbriefly in Section V. Section VI discusses blob tracking, movingobject tracking and Kalman filtering. Occlusion reasoning ispresented in Section VII. The incident detection module andcamera calibration are discussed in Section VIII and Section IX,respectively. Section X presents our results, followed by discus-sion and conclusions in Sections XI and XII.

II. I NTERSECTIONCOLLISION PREDICTION PROBLEM

Intersection monitoring is an important problem in thecontext of intelligent transportation systems (ITS). A real-timescene monitoring system capable of identifying situationsgiving rise to accidents would be very useful. Real-timeincident detection would require robust tracking of entities,projecting the current state of the scene to future time reliably,and identifying the colliding entities. The scope of this paper

1524-9050/03$17.00 © 2003 IEEE

VEERARAGHAVAN et al.: COMPUTER VISION ALGORITHMS FOR INTERSECTION MONITORING 79

(a) (b)

Fig. 1. (a) Approximated image of background and (b) current image. The background shows a long trail of the bus as the bus was modeled into the backgroundwhen it stopped.

Fig. 2. Oriented bounding boxes provide a much closer fit to the vehicles thanaxis aligned boxes.

is concerned with a real-time vision based system for trackingmoving entities. Reliable prediction requires very robusttracking. Achieving robust tracking in outdoor scenes is a hardproblem owing to the uncontrollable nature of the environment.Furthermore, tracking in the context of an intersection shouldbe able to handle non free-flowing traffic and arbitrary cameraviews. The tracker should also be capable of handling the largenumber of occlusions and interactions of the entities with eachother in the scene reliably.

III. RELATED WORK

A. Segmentation

Commonly used methods for motion segmentation such asstatic background subtraction work fairly well in constrained en-vironments. These methods, though computationally efficient,are not suitable for unconstrained, continuously changing en-vironments. Median filtering on each pixel with thresholdingbased on hysteresis was used by [18] for building a backgroundmodel. A single Gaussian model for the intensity of each pixelwas used by [22] for image segmentation in relatively staticindoor scenes. Alternatively, Friedmanet al. [8] used a mix-ture of three Gaussians for each pixel to represent the fore-

ground, background, and shadows using an incremental, expec-tation maximization method. Staufferet al. [19] used a mixtureof Gaussians for each pixel to adaptively learn the model of thebackground. Nonparametric kernel density estimation has beenused by [7] for scene segmentation in complex outdoor scenes.Cucchiaraet al. [5] combined statistical and knowledge-basedmethods for segmentation. A median filter is used for updatingthe background model selectively based on the knowledge aboutthe moving vehicles in the scene. Ridderet al. [16] used anadaptive background model updated using the Kalman filter.In [10], a mixture of Gaussians model with online expectationmaximization algorithms for improving the background updateis used.

B. Tracking

A large number of methods exist for tracking objects inoutdoor scenes. Coifmanet al. [4] employed a feature basedtracking method for tracking free flowing traffic using cornerpoints of vehicles as features. The feature points are groupedbased on the common motion constraint. Heiseleet al. [9]tracked moving objects in colored image sequences by trackingthe color clusters of the objects. Other tracking methods involveactive contour based tracking, 3-D model based tracking, andregion tracking.

A multilevel tracking scheme has been used in [6] for moni-toring traffic. The low-level consists of image processing whilethe high-level tracking is implemented as knowledge-basedforward chaining production system. McKennaet al. [14]performed three level tracking consisting of regions, people,and groups (of people) in indoor and outdoor environments.Kalman filter based feature tracking for predicting trajectoriesof humans was implemented by [17]. Kolleret al. [11] used atracker based on two linear Kalman filters, one for estimatingthe position and the other for estimating the shape of thevehicles moving in highway scenes. Similar to this approach,Meyeret al. [15] used a motion filter for estimating the affineparameters of an object for position estimation. A GeometricKalman filter was used for shape estimation wherein the shapeof the object was estimated by estimating the position of thepoints in the convex hull of the vehicles. In our application we


Fig. 3. Tracking approach.

are interested in the object’s position in the scene coordinates.Position estimation in this case can be done reliably usinga simple translational model moving with constant velocity.Furthermore, a region can be represented very closely by usingan oriented bounding box without requiring its convex hull.Our approach differs from that of Meyeret al. [15] in that weuse a simple translational model for estimating the positionof the centroid and the bounding box dimensions for shape.Although vehicle tracking has been generally addressed forfree flowing traffic in highway scenes, this is one of the firstpapers that address the tracking problem for nonfree flowing,cluttered scenes such as intersections.

IV. A PPROACH

An overview of our approach is depicted in Fig. 3. The inputto the system consists of gray scale images obtained from a sta-tionary camera. Image segmentation is performed using a mix-ture of Gaussian models method as in [19]. The individual re-gions are then computed by using a connected components ex-traction method. The various attributes of the blob such as cen-troid, area, elongation, and first and second-order moments arecomputed during the connected component extraction. In orderto obtain a close fit to the actual blob dimensions, appropri-

ately rotated bounding boxes (which we call oriented boundingboxes) are used. These are computed from principal componentanalysis of the blobs.

Blob tracking is then performed by finding associations be-tween the blobs in the current frame with those in the previousframe based on the proximity of the blobs. This is valid onlywhen the entities do not move very far in between two frames.Given the frame rate and the scenes, this is a valid assumption.The blobs in the current frame inherit the timestamp, label, andother attributes such as velocity from a related blob. The trackedblobs are later interpreted as MOs in the higher level. Positionestimation of the MOs is done using an extended Kalman filterwhile their shape estimation is done using a standard discreteKalman filter. The results from the shape estimator are used forocclusion detection.

The occlusion detection module detects occlusions on thebasis of the relative increase or decrease in the size of a givenblob with respect to the estimated size of its MO. Two dif-ferent thresholds are used for determining the extent of occlu-sion. The module also serves as a filter for the position measure-ments passed to the extended Kalman filter. The results from thetracking module are then passed onto the visualization modulewhere the tracker results can be viewed graphically.


V. MOVING OBJECTEXTRACTION

Tracking in outdoor, crowded scenes requires that thetracked entities can be segmented out reliably in spite of thecomplexities of the scene due to changing illumination, staticand moving shadows, uninteresting background (swaying treebranches, flags) and camera motion. The method should also befast enough so that no frames are skipped. Another requirementin this application is that stopped entities such as vehicles orpedestrians waiting for a traffic light should continue to betracked.

A. Background Segmentation

An adaptive Gaussian mixture model method based on [19]is used. Each pixel in the image is associated with a mixtureof Gaussian distributions (5 or 6) based on its intensities. Eachdistribution is characterized by a mean and variance anda weight representative of the frequency of occurrence of thedistribution. The method for segmentation is described in [19].The Gaussian distributions are sorted in the order of the mostcommon to the least common distribution and the pixels withmatching distribution having a weight above a certain thresholdare classified as background while the rest are classified as theforeground.

Moving entities are then extracted using a two pass connectedcomponents extraction method. In order to eliminate noise frombeing classified as foreground, a threshold is used so that anyblob with area lower than the threshold is deleted from the fore-ground.

B. Oriented Bounding Boxes

Horizontal and vertical axis aligned boxes cannot providetight fits to all vehicles moving in arbitrary directions. As a re-sult, oriented bounding boxes are used to represent blobs. Theoriented bounding box is computed using the two principal axesof the blob, which are in turn computed using principal compo-nent analysis. The covariance matrix used to compute this con-sists of the blob’s first and second-order moments

(1)

where, is the order moment of the blob. Diagonal-izing M gives

(2)

where represents the eigenvectors and

represents the eigenvalues. If ,

we choose as the principal axis with elongation . Theangle made by the principal axis with respect to thex axis ofthe image is also computed from the eigenvectors. Similarly,

is chosen as the second principal axis with elongation .The method of PCA is illustrated in Fig. 4.

VI. TRACKING

Tracking is performed at two levels. The lower level con-sists of blob tracking which, interacts with the image processing

Fig. 4. Principal component analysis.

(a) (b) (c)

Fig. 5. Blob splits and merges.

Fig. 6. Computing overlap between two bounding rectangles. The intersectingpoints are first computed and then ordered to form a convex polygon. The shadedarea represents the overlap area.

module. The tracked blobs are then abstracted as MOs which aretracked in the higher level.

A. Blob Tracking

In every frame, a relation between the blobs in the currentframe is sought with those in the previous frame. The relationsare represented in the form of an undirected bipartite graphwhich is then optimized based on the method described in [12].The following constraints are used in the optimization:

1) A blob may not simultaneously participate in a split andmerge at the same time;

2) Two blobs can be connected only if they have a boundingbox overlap area at least half the size of the smaller blob.

The blob splits and merges are illustrated in Fig. 5. The graphcomputation method is explained in detail in [20].

To compute the overlap between the bounding boxes of theblobs, a simple two-step method is used. In the first step, the


Fig. 7. Confidence ellipses of the tracked targets. The position is in world coordinates. The increase in uncertainty is shown by the increase in the size of theellipse in the regions where occlusion occurs and hence no measurement is available. The ellipse is centered around the position estimate at the current frame.

overlap area between the axis-aligned bounding boxes formedby the corner points between the blobs is computed. Thishelps to eliminate totally unrelated blobs. In the next step, theintersecting points between the two bounding rectangles arecomputed. The points are then ordered to form a closed convex

polygon whose area gives the overlap area. This is illustratedin Fig. 6.

The results from this module are passed onto the high-levelmodule where tracking consists of refining the position andshape measurements by means of Kalman filtering. An Ex-


tended Kalman filter is used for estimating the position of MOin scene coordinates while shape of the MO is estimated inimage coordinates using a discrete Kalman filter.

B. Kalman Filter Tracking

An explanation of the Kalman filter theory can be found in[1] and [21]. The position estimation filter is responsible for es-timating the target position in scene coordinates. The entitiesare assumed to move with constant velocities and any changesin the velocity are modeled as noise in the system. Because ofthe nonlinearity in the mapping from the state space (world co-ordinates) to the measurement space (image coordinates), anextended Kalman filter is used. The state vector is representedas , where , are the positions of the cen-troid in the x-y scene coordinates and, are the velocitiesin the x, y directions. The state transition matrix is given by

where is the time elapsed between two

frames. The error covariance of the system noise is given by

where and q is the vari-

ance in the acceleration.The measurement error covariance is given by

. The measurement error standard deviation

is obtained based on the variance in the percentage dif-ference in the measured and previously estimated size (area).The Jacobian of the measurement matrixis used due to thenonlinearity in the mapping from image to world coordinatesof the target’s positions.

The filter is initialized with the scene coordinate position ofthe object obtained by back projecting the image measurementsusing the homography matrix. The homography matrix is com-puted from the camera calibration. The filter estimates a modelof the motion of the target based on the measurements. The es-timate of the model corresponds to estimating the position andthe velocity of the target.

C. Measurement Vector

The measurement for an MO consists of the centroid of theblob (computed from the connected components extraction)and the oriented bounding box coordinates (computed usingthe principal component analysis). These measurements areobtained from the blob tracking module.

In order to ensure that the Kalman filters provide as accuratean estimate of the target as possible, it is necessary to providethe filters only with relevant measurements (measurements thatcan be distinguished as uniquely arising from the target). For ex-ample, when there is an occlusion, it is better to treat this caseas an absence of measurement than using this for estimation asit is ambiguous as to which object this measurement must be-long. The occlusion detection module acts as a filter serving todisregard erroneous measurements provided to the position andshape estimation Kalman filters. Erroneous measurements arethose when a target does not have a unique measurement (there

Fig. 8. Incident detection interface.

is no measurement which is associated only to this target) orwhen the measured blob’s area differs significantly from thetarget’s estimated bounding box area. Data association in caseof a single object related to multiple blobs (multiple measure-ments) is done by using a combination of the most related blob(nearest neighbor) or the average centroid of all the related blobs(when all the related blobs are very close to each other). In caseof multiple objects related to one or more same blobs (e.g., whentwo vehicles are close to each other and share one or more blobmeasurements), the measurements are best ignored and hencerendered as missing measurements in hope that the ambiguitywill clear up after a few frames. In this case, the Kalman filterwill take over with a prediction-only mode. The filter predictsbased on its estimates of the velocity and the position obtainedfrom the previous frame with increasing uncertainty as depictedin Fig. 7 as long as no measurement is available. As soon as ameasurement is obtained, the size of the ellipse decreases. Asshown in the Fig. 7, the ellipses are centered around the esti-mate at the current frame and the area of the ellipse correspondsto the covariance of the estimate. Higher the area, larger the co-variance in the estimate.

Generally, if the occlusions occur a few frames (at least 5 or6) after target instantiation (so that the motion parameters havebeen learnt with fairly high accuracy), the filter’s prediction isfairly reliable to several frames. However, one of the obviouslimitations of discarding measurements is that the filter’sprediction uncertainty increases and might become very largeand hence unreliable when a large number of measurementshas to be dropped. Such cases can arise very often in verycrowded scenes. Although dropping measurements is betterthan using incorrect measurements, it would be better if wecould somehow use at least some of the measurements byweighting the measurements probabilistically or by using cuesother than just overlaps to identify the target’s measurements(e.g., template of the target). Another related problem with


Fig. 9. Position estimation.

Fig. 10. Tracking sequence.

using blob overlaps and target blob proximity for taking associ-ated measurements is that, incorrect measurement associationsmight be formed (especially in cases when the target’s positionis highly uncertain) resulting in track divergence and trackjumping.

D. Shape Estimation

Currently, the main motivation for doing shape estimation isfor detecting occlusions. As a result, it suffices to do the esti-mation in the image coordinates. But later on, we would liketo do this estimation in scene coordinates for providing betterestimates to the incident detection module where collision de-tection is performed.

Three independent filters are used for shape estimation. Thebounding box shape can be estimated from its length and height.However, we also need to have an estimate of where to placethe box in the image. This can be known if the distance of onebounding box corner with respect to the centroid of the blob is

known. Hence, the parameters estimated are the distance (andcoordinate distance) of a corner point from the blob centroid,

the length and the height (measured asand coordinate dis-tances of the two other orthogonal corner points from this point).The state vector in each of the filter is represented aswhere and are the distances in image coordinates. The state

transition matrix is , and the measurement error

covariance for all the filters is based on the variance in the per-centage difference in the estimated and the measured area of theMO.

VII. OCCLUSION REASONING

Occlusions can be classified as one of the two types. The firsttype is inter-object occlusion. This occurs when one MO movesbehind the other. This kind of occlusion is the easiest to dealwith as long as the two targets have been tracked distinctly. Inthis case, two MOs share one or more blobs. As only blobs are


Fig. 11. Shape estimation for an occluded sequence. Occlusions are indicated by missing measurements.

used for establishing associations, it can be difficult to associatea blob uniquely to one target. As a result, the best thing to doin this case is just to ignore the measurements and let the indi-vidual filters of the MOs participating in the occlusion operatein prediction mode. The tracking in case of this occlusion is il-lustrated in the Fig. 10 between vehicles numbered 37 and 39.This case cannot be dealt with when the two targets enter theview of the camera occluded in the first place. One case whichcannot be handled is when the MOs deliberately participate inmerging (e.g., a pedestrian getting into a car). In this case, thepedestrian MO filter completely ignores the merging of the twotargets as occlusion and continues to estimate the pedestrian’sposition based on its previously estimated model.

The second type is object-background occlusion. This occurswhen the tracked MO moves behind or emerges from behind anexisting background structure. This can be further classified intotwo different types based on the effects on the blob size causedby the occlusion under the following two types:

1) Object moving behind thin background structures: Thescene structure in this case might be thin poles or treesand the effect of this occlusion results in blob splits asthe target moves behind the structure. As long as there isonly one target moving behind the structure, this can bedealt with as all the blobs are really related to the targetand the measurement can be taken as a weighted average(weighted based on the percentage overlap of the blobwith the target) of the blob centroid. This can get compli-cated when this occlusion is compounded with inter-ob-ject occlusion too. In that case, this is just treated as inter-object occlusion as described in the previous paragraph.

2) Object moving behind thick background structures: Thisis caused by structures such as buildings and overpasses,causing the foreground blobs that represent the MO to dis-appear from the scene for a certain length of time. As longas a good estimate of the MO is present, its position can beestimated and can be tracked as soon as it emerges out ofthe occlusion. One main problem associated with this isdue to the use of the centroid of the blob as a measurementfor the position of the MO. One common problem occursin the case of slow moving objects undergoing this kindof occlusion. As the MO starts moving behind the struc-ture, it results in gradual reduction in its blob size. If thisis not detected, it can look like a decrease inthe velocityof the target (as the centroid of the blob will shift towardthe unoccluded portion). The effect of this is that the pre-dicted MO (now being moved at a slower speed) will failto catch up with the blob as it eventually re-emerges. Inthis case, it is important and useful to detect the onset ofocclusion. This can be detected using shape estimationwhich is discussed in the following paragraph.

A. Shape Estimation Based Occlusion Reasoning

Occlusion reasoning is performed based on the discrepancyin the measured size and the estimated size. The results from theshape estimation module are used for this purpose. Accurate oc-clusion reasoning strongly depends on the accuracy of the shapeestimation. As long as there is no occlusion, the expected vari-ation in the area will be the same as the measured variation. Inother words, the expected area would be more or less the same asthe measured area. However, when there is an occlusion, there


Fig. 12. Shape estimation for a turn sequence.

will be a significant change in the expected area compared tothe measured area. The same holds for the case when the objectcomes out of an occlusion. For example, when a tracked ob-ject moves behind a background region, the measured area willbe much less compared to the expected area. Similarly, when atracked object comes out of an occlusion, its measured area willbe larger than the expected area

large occlusion

partial occlusion

no occlusion otherwise

where is the expected area of the blob, is the actualmeasured area, is the low threshold, and the highthreshold. The thresholds are used for determining the nature ofocclusion (between partial or total). When the percentage sizechanges between the measured and the expected area is abovea certain high threshold, it is hypothesized to be a large occlu-sion and a partial occlusion if it is above a low threshold butlower than a high threshold. These thresholds were determinedby trial and error based on testing using different values on dif-ferent scenes. We use a low threshold of about 0.3 to 0.4 and ahigh threshold value of about 0.8 to 0.9. These thresholds cor-respond to the percentage increase in the area (measured in theimage coordinates). However, using the same threshold in allplaces in the image itself has some limitations depending on thecamera view. For instance, depending on zooming effects, these

thresholds may work only when the vehicles are close to thecenter of the image.

The reason behind using two different thresholds is for de-tecting the nature of occlusion. In the case of partial occlusions,the position measurement is used for filter update but the shapemeasurements are ignored. In the case of total occlusions, boththe position and shape measurements are ignored. The reasonfor detecting the nature of thresholds is twofold. By taking mea-surements for position in the event of partial occlusion, we canprovide more measurements (though less reliable) to the filter.Secondly as the shape estimates are not updated, the onset of thetotal occlusion can be identified earlier and hence the problemdiscussed in Section VII on object moving behind thick back-ground structures can be addressed.

VIII. I NCIDENT DETECTIONVISUALIZATION MODULE

The results from the vision module are passed to the incidentdetection module. The incident detection module is responsiblefor detecting situations such as possible collisions between vehi-cles. For this, it uses the position, velocity, and shape (length andwidth) of the vehicles in scene coordinates obtained from the vi-sion module. Currently, our focus is only on collision detectionat the current frame. Collisions could be detected by checkingif the distance between any two vehicle bounding boxes is lessthan a threshold and the results can be presented visually in themodule as shown in Fig. 8. The module acts as a graphical userinterface providing real-time visualization of the data with aneasy to use VCR-like interface. The module can also be usedfor presenting the results of the tracking which is hence a veryuseful tool for debugging purposes. Fig. 8 shows a snapshot of


Fig. 13. Tracking sequence showing occlusion handling.

Fig. 14. Tracking sequence in winter.

Fig. 15. Tracking results in snow and shadow.

the interface. The vehicles are shown by rectangular boxes andthe vehicles in very close proximity are indicated by the linesegments. Camera calibration is used for recovering the scenecoordinates of the traffic objects.

IX. CAMERA CALIBRATION

Camera parameters are hard to obtain after the camera hasalready been installed in the scene. Hence, the parameters areobtained by estimation using the features in the scene. This isdone by identifying certain landmarks in the scene that are vis-ible in the image along with their distances in the real world. Acamera calibration tool described in [13] is used for calibration.The input to the tool consists of landmarks and their distancesin the scene. The tool computes the camera parameters usinga nonlinear least squares method. Once the scene is calibrated,any point in the image can be transformed to the scene coordi-nates (the corresponding point on the ground plane of the scene).

X. RESULTS

Our tracking system has been tested on a variety of weatherconditions such as sunny, cloudy, snow, etc. The results of atrack sequence are shown in Fig. 10. The tracking sequenceshown consists of a total of 44 frames with the results shownfor frame number 986, frame number 1014, and frame number1030. The lines behind the vehicles and pedestrians show thetrajectories of the vehicles and pedestrians. The numbers on thepedestrians and vehicles are the track labels assigned to everytracked MO. The tracker handles the occlusions between thecars very well as can be seen from the sequence. Fig. 13 showsocclusion handling between two vehicles. Tracking in a wintersequence is shown in Fig. 14 while Fig. 15 shows tracking insnow and shadow conditions.

The results of the Kalman filter position estimates for a ve-hicle are shown in Fig. 9. The position estimates of the Kalmanfilter are presented against the actual measurements. These re-sults are presented in the image coordinates. The results are pre-sented for a vehicle that was occluded multiple times as shown


in Fig. 9 (this is indicated by the absence of measurements).The sequence also illustrates occlusion handling between twovehicles. The results for the shape estimation for a vehicle un-dergoing occlusions and a turning vehicle are shown in Fig. 11and Fig. 12. The results are presented for the estimated lengthand height against the actual measurements. The turn sequenceshows an increase in the length and height of the vehicle asits pose with respect to the camera changes. The length andheight represent the coordinate difference between the estimatedbounding box corner point to its adjacent corner points on thebounding rectangle. This is the reason why some of the lengthand height measurements in the Figs. 11 and 12 have negativevalues.

XI. DISCUSSION

We now provide a brief discussion and insights to futurework. The two level Kalman filter based tracking is capable ofproviding robust tracking under most scenarios. Combining ashape estimation filter along with a position estimation filterhelps not only to identify occlusions but is also useful inpropagating only the reliable measurements to the high-leveltracking module. Good data association is essential for therobust performance of the Kalman filter. The system can beapplied reliably in most traffic scenes ranging from moderatelycrowded to even heavily crowded (as long as some reliablemeasurements can be provided to the filter through the se-quence).

The Gaussian mixture model approach works fairly well formost traffic scenes and can handle illumination changes fairlyquickly. However, this method cannot be used for trackingstopped vehicles. The reason being that the stopped vehiclesare modeled into the background. But for our purpose, wecannot assume vehicles or pedestrians waiting for a trafficsignal as background as they stop only for short periods of time.Although we can detect static cast shadows and model theminto the background, we cannot detect moving cast shadowsin the image. Moving cast shadows distort the shape of thevehicle and affect the quality of the tracker. These problems areaddressed to some extent by the two level tracking approach.For example, the Kalman filter can continue to track a vehiclefor some time even after it gets modeled into the backgroundbased on its previous estimates. Similarly, if the region wherethe moving shadows occur is a small region, the shape estimatorcan ignore this as a bad measurement.

Although the tracker performs very well in moderatelycrowded scenes with less background clutter, the performancedeteriorates in very cluttered scenes owing to the reason that anincreasing number of measurements are ignored with increasedcrowdensity resulting in tracking divergence. One problemwith the current shape estimation method is that sometimesit is difficult to distinguish between an occlusion and a posechange based on relative size increase or decrease. Treatingboth the cases with the same hypothesis (size change) is notsufficient and results in tracker inaccuracies. This in itselfsuggests several improvements to the tracker. Instead of usinga single hypothesis from the Kalman filter, we should be ableto formulate multiple hypotheses for tracking. The need for

multiple hypothesis-based tracking arises from the increasedambiguity in the data in the presence of clutter and the ambi-guity in distinguishing different motions (turn versus vehiclepassing under a background artifact). A probabilistic data asso-ciation filter has been used by [2] for tracking targets in clutter.Similarly, a multiple hypothesis approach which maintains abank of Kalman filters has been used by Chamet al. [3] fortracking human figures. Another direction for improvementwould involve using more cues from the image itself. Althoughstopped vehicles can be tracked for some more time by theKalman filter, they cannot be tracked reliably over long periodsof time without actual measurements. This requires changesto the existing segmentation method or improvements in thesegmentation method by additional measurements through atemplate of the region (constructed from previous trackinginstances) for example.

XII. CONCLUSION

A multilevel tracking approach for tracking the entitiesin intersection scenes is presented. The two level trackingapproach combines the low-level image processing withhigh-level Kalman filter based tracking. Combinations ofposition and shape estimation filters that interact with eachother indirectly are used for tracking. The shape estimationfilter serves the purpose of occlusion detection and helpsprovide reliable measurements to the position estimation filter.An incident detection visualization module has been developedwhich provides an easy to use graphical interface and on-linevisualization of the results.

ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers fortheir helpful and constructive comments.

REFERENCES

[1] Y. Bar-Shalom, X. Rongli, and T. Kirubarajan,Estimation With Appli-cations to Tracking and Navigation. New York: Wiley, 2001.

[2] K. Birmiwal and Y. Bar-Shalom, “On tracking a maneuvering target inclutter,” IEEE Trans. Aerosp. Electron. Syst., vol. AES-20, pp. 635–644,Sept. 1984.

[3] T. Cham and J. M. Rehg, “A multiple hypothesis approach to figuretracking,” in Proc. Computer Vision and Pattern Recognition Conf.(CVPR’99), June 1999, pp. 239–245.

[4] B. Coifman, D. Beymer, P. McLauchlan, and J. Malik, “A real-time com-puter vision system for vehicle tracking and traffic surveillance,”Trans-port. Res., pt. C, vol. 6, no. 4, pp. 271–288, 1998.

[5] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Statistic and knowl-edge-based moving object detection in traffic scenes,” inProc. IEEEITSC Int. Conf. Intelligent Transportation Systems, 2000.

[6] R. Cucchiara, P. Mello, and M. Piccaidi, “Image analysis and rule-basedreasoning for a traffic monitoring system,”IEEE Trans. Intell. Trans-port. Syst., vol. 1, pp. 119–130, June 2000.

[7] A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis, “Nonpara-metric kernel density estimation for visual surveillance,”Proc. IEEE,vol. 90, no. 7, pp. 1151–1163, July 2002.

[8] N. Friedman and S. Russell, “Image segmentation in video sequences:a probabilistic approach,” inProc. 13th Conf. Uncertainty in ArtificialIntelligence, 1997, pp. 175–181.

[9] B. Heisele, U. Kressel, and W. Ritter, “Tracking nonrigid, moving ob-jects based on color cluster flow,” inProc. Computer Vision and PatternRecognition Conf., 1997, pp. 257–260.


[10] P. KaewTraKulPong and R. Bowden, “An improved adaptive back-ground mixture model for real-time tracking with shadow detection,” inProc. 2nd Eur. Workshop Advanced Video Based Surveillance Systems,Sept. 2001.

[11] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking withocclusion reasoning,” inProc. Eur. Conf. Computer Vision (1), 1994,pp. 189–196.

[12] O. Masoud, “Tracking and analysis of articulated motion with appli-cation to human motion,” Ph.D. dissertation, Univ. Minnesota, Min-neapolis, MN, 2000.

[13] O. Masoud, S. Rogers, and N. P. Papanikolopoulos, “MonitoringWeaving Sections,”, Tech. Rep. CTS 01-06, Oct. 2001.

[14] S. J. McKenna, S. Jabri, Z. Duric, and H. Wechsler, “Tracking interactingpeople,” inProc. 4th Int. Conf. Automatic Face and Gesture Recognition,2000, pp. 348–353.

[15] F. G. Meyer and P. Bouthemy, “Region-based tracking using affine mo-tion models in long image sequences,”Computer Vis. Graph. ImageProc., vol. 60, no. 2, pp. 119–140, Sept. 1994.

[16] C. Ridder, O. Munkelt, and H. Kirchner, “Adaptive background estima-tion and foreground detection using Kalman filtering,” inProc. Int. Conf.Recent Advances Mechatronics, 1995, pp. 193–199.

[17] R. Rosales and S. Sclaroff, “Improved tracking of multiple humans withtrajectory prediction and occlusion modeling,”, Tech. Rep. 1998-007,1998.

[18] P. L. Rosin and T. J. Ellis, “Detecting and classifying intruders in imagesequences,” inProc. 2nd British Machine Vision Conf., Glasgow, U.K.,1991, pp. 293–300.

[19] C. Stauffer and W. E. L. Crimson, “Adaptive background mixturemodels for real-time tracking,” inProc. Computer Vision and PatternRecognition Conf. (CVPR’99), June 1999.

[20] H. Veeraraghavan, O. Masoud, and N. P. Papanikolopoulos, “Vi-sion-based monitoring of intersections,” inProc. IEEE Conf. IntelligentTransportation Systems, 2002.

[21] G. Welch and G. Bishop, “An introduction to the Kalman filter (tuto-rial),” in Proc. SIGGRAPH, 2001.

[22] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder:real-time tracking of the human body,”IEEE Trans. Pattern Anal. Ma-chine Intell., vol. 19, pp. 780–785, July 1997.

Harini Veeraraghavan received the B.Tech. degree in electrical engineeringfrom the Regional Engineering College, Kurukshetra, India and the M.S. degreein computer science from the University of Minnesota, Minneapolis, MN, in1999 and 2003, respectively. She is currently working toward the Ph.D. degreein computer science at the University of Minnesota.

Her research interests include, vision-based tracking, Kalman filter based es-timation, and jump linear systems.

Osama Masoudreceived the B.S. and M.S. degreesin computer science from King Fahd University ofPetroleum and Minerals (KFUPM), Dhahran, SaudiArabia, in 1992 and 1994, respectively, and the Ph.D.degree in computer science from the University ofMinnesota, Minneapolis, MN, in 2000.

Previously, he was a Postdoctoral Associate at theDepartment of Computer Science and Engineering atthe University of Minnesota and served as the Di-rector of research and development at Point Cloud

Inorporated, Plymouth, MN. He is currently a Research Associate at the De-partment of Computer Science and Engineering at the University of Minnesota.His research interests include computer vision, robotics, transportation applica-tions, and computer graphics.

Dr. Masoud received a Research Contribution Award from the University ofMinnesota, the Rosemount Instrumentation Award from Rosemount Incorpo-rated, and the Matt Huber Award for Excellence in Transportation Research.One of his papers (coauthored by N.P. Papanikolopoulos) was awarded the IEEEVTS 2001 Best Land Transportation Paper Award.

Nikolaos P. Papanikolopoulos(S’88–M’92–SM’01)was born in Piraeus, Greece, in 1964. He receivedthe Diploma degree in electrical and computer en-gineering from the National Technical University ofAthens, Athens, Greece, in 1987, and the M.S.E.E.degree in electrical engineering and the Ph.D. degreein electrical and computer engineering from CarnegieMellon University (CMU), Pittsburgh, PA, in 1988and 1992, respectively.

Currently, he is a Professor in the Department ofComputer Science at the University of Minnesota and

Director of the Center for Distributed Robotics. His research interests includerobotics, computer vision, sensors for transportation applications, control, andintelligent systems. He has authored or coauthored more than 165 journal andconference papers in the above areas (40 refereed journal papers).

Dr. Papanikolopoulos was a finalist for the Anton Philips Award for Best Stu-dent Paper at the 1991 IEEE Int. Conf. on Robotics and Automation, and therecipient of the best Video Award in the 2000 IEEE Int. Conf. on Robotics andAutomation and the Kritski fellowship in 1986 and 1987. He was a McKnightLand-Grant Professor at the University of Minnesota for the period 1995–1997and has received the NSF Research Initiation and Early Career DevelopmentAwards. He was also awarded the Faculty Creativity Award from the Univer-sity of Minnesota. One of his papers (coauthored by O. Masoud) was awardedthe IEEE VTS 2001 Best Land Transportation Paper Award. Finally, he hasreceived grants from DARPA, Sandia National Laboratories, NSF, Microsoft,INEEL, USDOT, MN/DOT, Honeywell, and 3M.

computer vision algorithms for intersection …...tracking method for tracking free flowing traffic...

Documents