sigal iwcm color - cs.brown.educs.brown.edu/people/mjblack/papers/sigal_iwcm_color.pdf · title:...

Tracking Complex Objects using

Graphical Object Models

Leonid Sigal1, Ying Zhu3, Dorin Comaniciu2 and Michael Black1

1 Department of Computer Science, Brown University, Providence, RI 02912{ls, black}@cs.brown.edu

2 Integrated Data Systems, Siemens Corporate Research, Princeton, NJ [email protected]

3 Real Time Vision & Modeling, Siemens Corporate Research, Princeton, NJ [email protected]

Abstract. We present a probabilistic framework for component-basedautomatic detection and tracking of objects in video. We represent ob-jects as spatio-temporal two-layer graphical models, where each nodecorresponds to an object or component of an object at a given time, andthe edges correspond to learned spatial and temporal constraints. Objectdetection and tracking is formulated as inference over a directed loopygraph, and is solved with non-parametric belief propagation. This typeof object model allows object-detection to make use of temporal consis-tency (over an arbitrarily sized temporal window), and facilitates robusttracking of the object. The two layer structure of the graphical modelallows inference over the entire object as well as individual components.AdaBoost detectors are used to define the likelihood and form proposaldistributions for components. Proposal distributions provide ‘bottom-up’ information that is incorporated into the inference process, enablingautomatic object detection and tracking. We illustrate our method bydetecting and tracking two classes of objects, vehicles and pedestrians,in video sequences collected using a single grayscale uncalibrated car-mounted moving camera.

1 Introduction

The detection and tracking of complex objects in natural scenes requires richmodels of object appearance that can cope with variability among instancesof the object and across changing viewing and lighting conditions. Traditionaloptical flow methods are often ineffective for tracking objects because they arememoryless; that is, they lack any explicit model of object appearance. Here weseek a model of object appearance that is rich enough for both detection andtracking of objects such as people or vehicles in complex scenes. To that end wedevelop a probabilistic framework for automatic component-based detection andtracking. By combining object detection with tracking in a unified framework wecan achieve a more robust solution for both problems. Tracking can make use ofobject detection for initialization and re-initialization during transient failures or

occlusions, while object detection can be made more reliable by considering theconsistency of the detection over time. Modeling objects by an arrangement ofimage-based (possibly overlapping) components, facilitates detection of complexarticulated objects, as well as helps in handling partial object occlusions or localillumination changes.

Object detection and tracking is formulated as inference in a two-layer graph-ical model in which the coarse layer node represents the whole object and thefine layer nodes represent multiple component “parts” of the object. Directededges between nodes represent learned spatial and temporal probabilistic con-straints. Each node in the graphical model corresponds to a position and scaleof the component or the object as a whole in an image at a given time instant.Each node also has an associated AdaBoost detector that is used to define thelocal image likelihood and a proposal process. In general the likelihoods and de-pendencies are not Gaussian. To infer the 2D position and scale at each node weexploit a form of non-parametric belief propagation (BP) that uses a variationof particle filtering and can be applied over a loopy graph [8, 15].

The problem of describing and recognizing categories of objects (e.g. faces,people, cars) is central to computer vision. It is common to represent objects ascollections of features with distinctive appearance, spatial extent, and position[2, 6, 10, 11, 16, 17]. There is however a large variation in how many features onemust use and how these features are detected and represented. Most algorithmsrely on semi-supervised learning [11, 16, 17] schemes where examples of the de-sired class of objects must be manually aligned, and then learning algorithmsare used to automatically select the features that best separate the images ofthe desired class from background image patches. More recent approaches learnthe model in an unsupervised fashion from a set of unlabeled and unsegmentedimages [2, 6]. In particular, Fergus et al [6] develop a component based objectdetection algorithm that learns an explicit spatial relationship between parts ofan object, but unlike our framework assumes Gaussian likelihoods and spatialrelationships and does not model temporal consistency.

In contrast to part-based representations, simple discriminative classifierstreat an object as a single image region. Boosted classifiers [16], for example,while very successful tend to produce a large set of false positives. While thisproblem can be reduced by incorporating temporal information [17], discrimina-tive classifiers based on boosting do not explicitly model parts or componentsof objects. Such part-based models are useful in the presence of partial occlu-sions, out-of-plane rotation and/or local lighting variations [5, 11, 18]. Part- orcomponent-based detection is also capable of handling highly articulated objects[10], for which a single appearance model classifier may be hard to learn. An il-lustration of the usefulness of component-based detection for vehicles is shownin Fig. 1. While all vehicles have almost identical parts (tires, bumper, hood,etc.) their placement can vary significantly due to large variability in the heightand type of vehicles.

Murphy et al [12] also use graphical models in the patch-based detectionscheme. Unlike our approach they do not incorporate temporal information or

Fig. 1. Variation in the vehicle class of objects is shown. While objects shown herehave a drastically different appearance as a whole due to the varying height and typeof the vehicle, their components tend to be very homogeneous and are easy to model.

explicitly reason about the object as a whole. Also closely related is the work of[13] which uses AdaBoost for multi-target tracking and detection. However, theirBoosted Particle Filter [13] does not integrate component-based object detectionand is limited to temporal propagation in only one direction (forward in time). Incontrast to these previous approaches we combine techniques from discriminativelearning, graphical models, belief propagation, and particle filtering to achievereliable multi-component object detection and tracking.

In our framework, object motion is represented via temporal constraints(edges) in the graphical model. These model-based constraints for the objectand components are learned explicitly from the labeled data, and make no useof the optical flow information. However the model could be extended to use ex-plicit flow information as part of the likelihood model, or as part of the proposalprocess. In particular, as part of the proposal process, optical flow informationcan be useful in focusing the search to the regions with “interesting” motion,that are likely to correspond to an object or part/component of an object.

2 Graphical Object Models

Following the framework of [14] we model an object as a spatio-temporal di-rected graphical model. Each node in the graph represents either the objector a component of the object at time t. Nodes have an associated state vectorXT = (x, y, s) defining the component’s real-valued position and scale withinan image. The joint probability distribution for this spatio-temporal graphicalobject model with N components and over T frames can be written as:

P (XO0 ,X

C0

0,X

C1

0, ...,X

CN0

, ......,XOT ,X

C0

T ,XC1

T , ...,XCNT ,Y0,Y1, ...,YT ) =

1

Z

∏

ij

ψij(XOi ,X

Oj )

∏

ik

ψik(XOi ,X

Cki )

∏

ikl

ψkl(XCki ,X

Cli )

∏

i

φi(Yi,XOi )

∏

ik

φi(Yi,XCki )

where XOt and XCn

t is the state of the object, O, and object’s n-th component,Cn, at time t respectively (n ∈ (1, N) and t ∈ (1, T )); ψij(X

Oi ,X

Oj ) is the tempo-

ral compatibility of the object state between frames i and j; ψik(XOi ,X

Ck

i ) is the

spatial compatibility of the object and it’s components at frame i; ψkl(XCk

i ,XCl

i )is the spatial compatibility between object components at frame i; and φi(Yi,X

Oi )

and φi(Yi,XCk

i ) denote the local evidence (likelihood) for the object and com-ponent states respectively.

Our framework can be viewed as having five distinct components: (i) a graph-ical model, (ii) an inference algorithm that infers a probability distribution overthe state variables at each node in the graph, (iii) a local evidence distribution(or image likelihood), (iv) a proposal process for some or all nodes in a graphicalmodel, and (v) a set of spatial and/or temporal constraints corresponding to theedges in a graph. We will now discuss each one of these in turn.

2.1 Building the Graphical Model

For a single frame we represent objects using a two-layer spatial graphical model.The fine, component, layer contains a set of loosely connected “parts.” Thecoarse, object, layer corresponds to an entire appearance model of the objectand is connected to all constituent components. Examples of such models forpedestrian and vehicle detection are shown in a the shaded regions of Fig. 2aand 2b respectively. In both cases objects are modeled using four overlappingimage components. For the vehicle the components are: top-left (TL), top-right(TR), bottom-right (BR) and bottom-left (BL) corners; while for the pedestrian,they are: head (HD), left arm (LA), right arm (RA) and legs (LG) (see Fig. 3ab).

To integrate temporal constraints we extend the spatial graphical models overtime to an arbitrary-length temporal window. The resulting spatio-temporalgraphical models are shown in Fig. 2a and 2b. Having a two-layer graphicalmodel, unlike the single component layer model of [14], allows the inference pro-cess to reason explicitly about the object as a whole, as well as helps reducethe complexity of the graphical model, by allowing the assumption of indepen-dence of components over time conditioned on the overall object appearance.Alternatively, one can also imagine building a single object layer model, whichwould be similar to the Boosted Particle Filter [13] (with bi-directional temporalconstraints).

2.2 Learning Spatial and Temporal Constraints

Each directed edge between components i and j has an associated potential func-tion ψij(Xi,Xj) that encodes the compatibility between pairs of node states. Thepotential ψij(Xi,Xj) is modeled using a mixture of Mij Gaussians (following[14])

ψij(Xi,Xj) = λ0N (Xj ;µij , Λij) + (1 − λ

0)

Mij∑

m=1

πijmN (Xj ;Fijm(Xi), Gijm(Xi))

where λ0 is a fixed outlier probability, µij and Λij are the mean and covari-ance of the Gaussian outlier process, and Fijm(Xi) and Gijm(Xi) are functionsthat return the mean and covariance matrix respectively of the m-th Gaussian

(a) (b)

Fig. 2. Graphical models for the (a) pedestrian and (b) vehicle detection and tracking.Spatio-temporal models are obtained by replicating a spatial model (shown by theshaded region) along the temporal domain to a w-length window and then connectingthe object layer nodes across time.

mixture component. πijm is the relative weight of an individual component and∑Mij

m=1πijm = 1. For experiments in this paper we used Mij = 2 mixture com-

ponents.Given a set of labeled images, where each component is associated with a

single reference point, we use standard iterative Expectation-Maximization (EM)algorithm with K-means initialization to learn Fijm(Xi) of the form:

Fijm(Xi) = Xi +

[

µxijm

µsijm

,µ

yijm

µsijm

, µsijm

]T

(1)

where µxijm, µy

ijm, µsijm is the mean position and scale of component or ob-

ject j relative to i. Gijm(Xi) is assumed to be diagonal matrix, representingthe variance in relative position and scale. Examples of the learned conditionaldistributions can be seen in Fig. 3cde.

2.3 AdaBoost Image Likelihoods

The likelihood, φ(Y,Xi) models the probability of observing the image Y con-ditioned on the state Xi of the node i, and ideally should be robust to partialocclusions and the variability of image statistics across many different inputs.To that end we build our likelihood model using a boosted classifier.

Following [16] we train boosted detectors for each component. For simplicitywe use AdaBoost [16] without a cascade (training with a cascade would likelyimprove the computational efficiency of the system). In order to reduce the num-ber of false positives produced by the detectors, we use a bootstrap procedurethat iteratively adds false positives that are collected by running the trainedstrong classifier over the set of background images (not containing the desiredobject) and then re-training the detectors using the old positive and the newextended negative sets.

(a) (b) (c) (d) (e)

Fig. 3. Components for the (a) pedestrian and (b) vehicle object models (entire ap-pearance model is in cyan) and learned conditional distributions from (c) Bottom-Left(BL) to Top-Left (TL) component, (d) Bottom-Left (BL) to the whole appearancemodel, and (e) whole appearance model to the Bottom-Left (BL) component.

Given a set of labeled patterns the AdaBoost procedure learns a weightedcombination of base weak classifiers, h(I) =

∑Kk=1

αkhk(I), where I is an imagepattern, and hk(I) is the weak classifier chosen for the round k of boosting, andαk is the corresponding weight. We use a weak classifier scheme similar to the

one discussed in [16]: hk(I) = pk

([

βk

√

(fk(I))βk < θk

])

, where fk(I) is a feature

of the pattern I computed by convolving I with the delta function over theextent of a spatial template; θk is a threshold, pk is the polarity indicating thedirection of inequality, and βk ∈ [1, 2] allowing for a symmetric two sided pulseclassification.

The output of the AdaBoost classifier is a confidence hk(I) that the givenpattern I is of the desired class. It is customary to consider an object presentif hk(I) ≥ 1

2

∑Kk=1

αk. We convert this confidence into a likelihood function byfirst normalizing the αk’s, so that h(I) ∈ [0, 1], and then exponentiating

φ(Y,Xi) ∝ exp(h(I)/T ) (2)

where image pattern I is obtained by cropping full image Y based on the stateof the object or component Xi; and T is an artificial temperature parameterthat controls the smoothness of the likelihood function, with smaller values of Tleading to peakier distribution. Consequently we can also anneal the likelihoodby deriving a schedule with which T changes. We found an exponential annealingschedule T = T0υ

κ, where T0 is the initial temperature, υ is a fraction ∈ (0, 1),and κ is the annealing iteration, to work well in practice. AdaBoost classifiers arelearned using a database of 861 vehicles and 662 pedestrians [11]. The number ofnegative examples after bootstrapping tends to be on the order of 2000 to 3000.

Depending on an object one may or may not have a likelihood or a proposalprocess for the object layer nodes. For example if the whole appearance of anobject is indeed too complicated to model as a whole (e.g. arbitrary size vehicles)and can only be modeled in terms of components, we can simply assume auniform likelihood over the entire state space. In such cases the object layer

nodes simply fuse the component information to produce estimates for the objectstate that are consistent over time.

It is worth noting that the assumption of local evidence independence im-plicit in our graphical model is only approximate, and may be violated in theregions where object and components overlap. In such cases the correlation orbias introduced into the inference process will depend on the nature of the fil-ters chosen by the boosting procedure. While this approximation works well inpractice, we plan to study it more formally in the future.

2.4 Non-parametric BP

Inferring the state of the object and it’s components in our framework is definedas estimating belief in a graphical model. We use a form of non-parametric beliefpropagation [8] Pampas to deal with this task. The approach is a generalizationof particle filtering [4] which allows inference over arbitrary graphs rather then asimple chain. In this generalization the ‘message’ used in standard belief propa-gation is approximated with a kernel density (formed by propagating a particleset through a mixture of Gaussians density), and the conditional distributionused in standard particle filtering is replaced by product of incoming messages.Most of the computational complexity lies in sampling from a product of ker-nel densities required for message passing and belief estimation; we use efficientsequential multi-scale Gibbs sampling and epsilon-exact sampling [7] to addressthis problem.

Individual messages may not constrain a node well, however the product overall incoming messages into the node tends to produce a very tight distributionin the state space. For example, any given component of a vehicle is incapableof estimating the height of the vehicle reliably, however once we integrate infor-mation from all components in the object layer node, we can get a very reliableestimate for the overall object size.

More formally a message mij from node i to node j is written as

mij(Xj) =

∫

ψij(Xi,Xj)φi(Yi,Xi)∏

k∈Ai/j

mki(Xi)dXi, (3)

where Ai/j is the set of neighbors of node i excluding node j and φi(Yi,Xi) isthe local evidence (or likelihood) associated with the node i, and ψij(Xi,Xj) isthe potential designating the compatibility between the states of node i and j.The details of how the message updates can be carried out by stratified samplingfrom belief and proposal function see [14].

While it is possible and perhaps beneficial to perform inference over thespatio-temporal model defined for the entire image sequence, there are manyapplications for which this is impractical due to the lengthy off-line processingrequired. Hence, we use a w-frame windowed smoothing algorithm where w isan odd integer ≥ 1 (see Fig. 2). There are two ways one can do windowedsmoothing: in an object-detection centric way or a tracking-centric way. In theformer we re-initialize all nodes every time we shift a window, hence the temporal

integration is only applied in the window of size w. In the tracking centric waywe only initialize the nodes associated with a new frame, which tends to enforcetemporal consistency from before t−(w−1)/2. While the later tends to convergefaster and produce more consistent results over time, it is also less sensitive toobjects entering and leaving the scene. Note that with w = 1, the algorithmresembles single-frame component-based fusion [18].

2.5 Proposal Process

To reliably detect and track the object non-parametric BP makes use of thebottom-up proposal process, that constantly looks for and suggests alternativehypothesis for the state of the object and components. We model a proposaldistribution using a weighted particle set. To form a proposal particle set fora component, we run the corresponding AdaBoost detector over an image ata number of scales to produce a set of detection results that score above the1

2

∑Kk=1

αk threshold. While this set tends to be manageable for the entire ap-pearance model, it is usually large for non-specific component detectors (a fewthousand locations can easily be found). To reduce the dimensionality we onlykeep the top P scoring detections, where P is on the order of a 100 to 200. Toachieve breadth of search we generate proposed particles by importance sam-pling uniformly from the detections. For more details on the use of the proposalprocess in the Pampas framework see [14].

3 Experiments

Tests were performed using a set of images collected with a single car-mountedgrayscale camera. The result of vehicle detection and tracking over a sequenceof 55 consecutive frames can be seen in Fig. 5. A 3-frame spatio-temporal objectmodel was used and was shifted in a tracking-centric way over time. We run BPwith 30 particles for 10 iterations at every frame. For comparison we implementeda simple fusion scheme that averages the best detection result from each of thefour components (Fig. 5(b) ‘Best Avg.’) to produce an estimate for the vehicleposition and scale independently at every frame. The performance of the simplefusion detection is very poor suggesting that the noisy component detectorsoften do not have the global maximum at the correct position and scale. Incontrast, the spatio-temporal object model consistently combines the evidencefor accurate estimates throughout the sequence.

The performance of the pedestrian spatio-temporal detector is shown inFig. 6. A 3-frame spatio-temporal object model is run at a single instance intime for two pedestrians in two different scenes. Similar to the vehicle detec-tion we run BP with 30 particles for 10 iterations. For both experiments thetemperature of the likelihood is set to T0 = 0.2.

While in general the algorithm presented here is capable of detecting multipletargets, by converging to multi-modal distributions for components and objects,in practice this tends to be quite difficult and requires many particles. Particle

filters in general have been shown to have difficulties when tracking multi-modaldistributions [13]. The Pampas framework used here is an extension of parti-cle filtering, and the message update involves taking a product over particlesets, consequently, Pampas suffers from similar problems. Furthermore, beliefpropagation over a loopy graph such as ours may further hinder the modelingof multi-modal distributions. To enable multi-target tracking then we employ apeak suppression scheme, where modes are detected one at a time, and then theresponse of the likelihood function is suppressed in the regions where peaks havealready been found. An example of this obtained by running a purely spatialgraphical model over the image containing 6 vehicles is shown in Fig. 4.

4 Conclusion

In this paper we present a novel object detection and tracking framework ex-ploiting boosted classifiers and non-parametric belief propagation. The approachprovides component-based detection and integrates temporal information overan arbitrary size temporal window. We illustrate the performance of the frame-work with two classes of objects: vehicles and pedestrians. In both cases we canreliably infer position and scale of the objects and their components. Furtherwork needs to be done to evaluate how the method copes with changing lightingand occlusion. Additional work is necessary to develop a mutli-target schemethat incorporates a probabilistic model of the entire image.

The algorithm developed here is quite general and might be applied to otherobjection tracking and motion estimation problems. For example, we might for-mulate a parameterized model of facial motion in which the optical flow in dif-ferent image regions (mouth, eyes, eyebrows) are modeled independently. Thesemotion parameters for these regions could then be coupled via the graphicalmodel and combined with a top-level head tracker. Such an approach mightoffer improved robustness over previous methods for modeling face motion [1].

References

1. Recognizing facial expressions in image sequences using local parameterized modelsof image motion, M. J. Black and Y. Yacoob. International Journal of ComputerVision, 25(1), pp. 23–48, 1997.

2. M. Burl, M. Weber and P. Perona. A probabilistic approach to object recognitionusing local photometry and global geometry, ECCV, pp. 628–641, 1998.

3. J. Coughlan and S. Ferreira. Finding deformable shapes using loopy belief propa-gation, ECCV Vol. 3, pp. 453–468, 2002.

4. A. Douce, N. de Freitas and N. Gordon. Sequential Monte Carlo methods in practice,Statistics for Engineering and Information Sciences, pp. 3–14, Springer Verlag, 2001.

5. P. Felzenszwalb and D. Huttenlocher. Efficient matching of pictorial structures,CVPR, Vol. 2, pp. 66–73, 2000.

6. R. Fergus, P. Perona and A. Zisserman. Object class recognition by unsupervisedscale-invariant learning, CVPR, 2003.

7. A. Ihler, E. Sudderth, W. Freeman and A. Willsky. Efficient multiscale samplingfrom products of Gaussian mixtures, Advances in Neural Info. Proc. Sys. 16, 2003.

8. M. Isard. Pampas: Real-valued graphical models for computer vision, CVPR, Vol.1, pp. 613–620, 2003.

9. M. Jordan, T. Sejnowski and T. Poggio. Graphical models: Foundations of neuralcomputation, MIT Press, 2001.

10. K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a prob-abilistic assembly of robust part detectors, ECCV, 2004.

11. A. Mohan, C. Papageorgiou and T. Poggio. Example-based object detection inimages by components, IEEE PAMI, 23(4):349–361, 2001.

12. K. Murphy, A. Torralba and W. Freeman. Using the forest to see the trees: Agraphical model relating features, objects, and scenes, Advances in Neural Info.Proc. Sys. 16, 2003.

13. K. Okuma, A. Teleghani, N. de Freitas, J. Little and D. Lowe. A boosted particlefilter: Multitarget detection and tracking, ECCV, 2004.

14. L. Sigal, S. Bhatia, S. Roth, M. Black and M. Isard. Tracking loose-limbed people,CVPR, 2004.

15. E. Sudderth, A. Ihler, W. Freeman and A. Willsky. Nonparametric belief propa-gation, CVPR, Vol. 1, pp. 605–612, 2003; (see also MIT AI Lab Memo 2002-020).

16. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simplefeatures, CVPR, 2001.

17. P. Viola, M. Jones and D. Snow. Detecting pedestrians using patterns of motionand appearance, ICCV, pp. 734–741, 2003.

18. B. Xie, D. Comaniciu, V. Ramesh, M. Simon and T. Boult. Component fusion forface detection in the presence of heteroscedastic noise, Annual Conf. of the GermanSociety for Pattern Recognition (DAGM’03), pp. 434–441, 2003.

1 23 45

6

78

Fig. 4. Vehicle component-based spatio-temporal object detection for multiple targets.The algorithm was queried for 8 targets. The mean of 30 samples from the belief foreach object are shown in red. Targets are found one at the time using an iterativeapproach that adjusts the likelihood functions to down weight regions where targetshave already been found.

(a)(TL) (TR) (BR) (BL)

(b)(Best Avg) (Components) (Object)

Fig. 5. Vehicle component-based spatio-temporal object detection and tracking. (a)shows samples from the initialization/proposal distribution, and (b) 30 samples takenfrom the belief for each of the four component (middle) and an object (right). Thedetection and tracking was conducted using a 3-frame smoothing window. Frames 2through 52 are shown (top to bottom respectively) at 10 frame intervals. For comparison(b) (left) shows the performance of a very simple fusion algorithm, that fuses the bestresult from each of the components by averaging.

(a)

(HD) (LUB) (RUB)

(LB) (Object)

(Components) (Object)

(b)

(HD) (LUB) (RUB)

(LB) (Object)

(Components) (Object)

Fig. 6. Pedestrian component-based spatio-temporal object detection for two subjects(a) and (b). (top) shows the initialization/proposal distribution, and (bottom) 30 sam-ples taken from the belief for each of the four component and an object. The detectionwas conducted using a 3-frame temporal window.

sigal iwcm color - cs.brown.educs.brown.edu/people/mjblack/papers/sigal_iwcm_color.pdf · title:...

Documents