detection paper

12
1810 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011 Online Distance Metric Learning for Object Tracking Grigorios Tsagkatakis, Student Member, IEEE, and Andreas Savakis, Senior Member, IEEE Abstract —Tracking an object without any prior information regarding its appearance is a challenging problem. Modern tracking algorithms treat tracking as a binary classification problem between the object class and the background class. The binary classifier can be learned offline, if a specific object model is available, or online, if there is no prior information about the object’s appearance. In this paper, we propose the use of online distance metric learning in combination with nearest neighbor classification for object tracking. We assume that the previous appearances of the object and the background are clustered so that a nearest neighbor classifier can be used to distinguish between the new appearance of the object and the appearance of the background. In order to support the classification, we employ a distance metric learning (DML) algorithm that learns to separate the object from the background. We utilize the first few frames to build an initial model of the object and the background and subsequently update the model at every frame during the course of tracking, so that changes in the appearance of the object and the background are incorporated into the model. Furthermore, instead of using only the previous frame as the object’s model, we utilize a collection of previous appearances encoded in a template library to estimate the similarity under variations in appearance. In addition to the utilization of the online DML algorithm for learning the ob- ject/background model, we propose a novel feature representation of image patches. This representation is based on the extraction of scale invariant features over a regular grid coupled with dimensionality reduction using random projections. This type of representation is both robust, capitalizing on the reproducibility of the scale invariant features, and fast, performing the tracking on a reduced dimensional space. The proposed tracking algorithm was tested under challenging conditions and achieved state-of-the art performance. Index Terms—Distance metric learning, nearest neighbor clas- sification, object tracking, online learning, random projections. I. Introduction O BJECT TRACKING is a vital part of many computer vision applications, including surveillance, human com- puter interaction, smart spaces, and gaming, among others. Object tracking is a very challenging task due to appearance Manuscript received July 27, 2010; revised November 15, 2010; accepted December 19, 2010. Date of publication March 28, 2011; date of current version December 7, 2011. This work was supported in part by Eastman Kodak Company and the Center for Emerging and Innovative Sciences, a NYSTAR-Designated Center for Advanced Technology in New York State. This paper was recommended by Associate Editor F. G. B. De Natale. The authors are with the Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY 14623 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2011.2133970 variations caused by occlusions and changes in illumination and pose. This task can become even more demanding when there is no prior information regarding the object’s appearance. Traditional object tracking algorithms model the appearance of the object in a generative way, while modern techniques view tracking as a classification problem with temporal priors. In the latter context, the goal is to locate an image region that belongs to the same class as the tracked object, under the constraint that the new location is spatially adjacent to the previous one. This novel approach has received a lot of attention due to its ability to maintain accurate tracking despite severe conditions. Based on the binary classification paradigm, a discrimina- tive tracking algorithm can be trained either offline or online. If an object model is not available, modern online learning approaches for discriminative tracking utilize the boosting framework to incrementally train a binary classifier that will be able to separate the object from the background. These methods have solid theoretical background and have demon- strated very promising results. In similar spirit to the online trained discriminative tracking approaches, we propose a novel tracking method where the problem is treated as a nearest neighbor classification with online learned distance. The idea is based on the stipulation that different appearances of the object will be close to each other in a suitable feature space, i.e., the feature representation of the object’s appearance in the current frame will be close to its feature representation in the previous frame. However, measuring similarity with Euclidean distance, as is typically done in nearest neighbor classification, is not adequate, since Euclidean distance does not encode any discriminative information. As a result, tracking is likely to fail due to misclassification under mild changes [3]. To overcome this impediment, we propose to learn an appropriate distance metric that will yield small distances for different appearances of the object and large distances for appearances of non-objects (background). This approach is formally known as distance metric learning (DML) and its goal is to discover a distance metric that will satisfy the constraints imposed by class labels, i.e., keep data points from the same class close to each other and map data points from different classes far apart. In this paper, we advocate that the utilization of an online learned distance can provide increased robustness compared to using a predefined or a pre-learned metric. Object localization is facilitated by searching for the region that minimizes the learned distance between reference templates and the current appearance of the object. The learned distance is updated 1051-8215/$26.00 c 2011 IEEE

Upload: gamble19844891

Post on 20-Jul-2016

214 views

Category:

Documents


0 download

DESCRIPTION

Image Processing based paper

TRANSCRIPT

1810 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011

Online Distance Metric Learning forObject Tracking

Grigorios Tsagkatakis, Student Member, IEEE, and Andreas Savakis, Senior Member, IEEE

Abstract—Tracking an object without any prior informationregarding its appearance is a challenging problem. Moderntracking algorithms treat tracking as a binary classificationproblem between the object class and the background class. Thebinary classifier can be learned offline, if a specific object modelis available, or online, if there is no prior information about theobject’s appearance. In this paper, we propose the use of onlinedistance metric learning in combination with nearest neighborclassification for object tracking. We assume that the previousappearances of the object and the background are clusteredso that a nearest neighbor classifier can be used to distinguishbetween the new appearance of the object and the appearanceof the background. In order to support the classification, weemploy a distance metric learning (DML) algorithm that learnsto separate the object from the background. We utilize thefirst few frames to build an initial model of the object andthe background and subsequently update the model at everyframe during the course of tracking, so that changes in theappearance of the object and the background are incorporatedinto the model. Furthermore, instead of using only the previousframe as the object’s model, we utilize a collection of previousappearances encoded in a template library to estimate thesimilarity under variations in appearance. In addition to theutilization of the online DML algorithm for learning the ob-ject/background model, we propose a novel feature representationof image patches. This representation is based on the extractionof scale invariant features over a regular grid coupled withdimensionality reduction using random projections. This type ofrepresentation is both robust, capitalizing on the reproducibilityof the scale invariant features, and fast, performing the trackingon a reduced dimensional space. The proposed tracking algorithmwas tested under challenging conditions and achieved state-of-theart performance.

Index Terms—Distance metric learning, nearest neighbor clas-sification, object tracking, online learning, random projections.

I. Introduction

OBJECT TRACKING is a vital part of many computervision applications, including surveillance, human com-

puter interaction, smart spaces, and gaming, among others.Object tracking is a very challenging task due to appearance

Manuscript received July 27, 2010; revised November 15, 2010; acceptedDecember 19, 2010. Date of publication March 28, 2011; date of currentversion December 7, 2011. This work was supported in part by EastmanKodak Company and the Center for Emerging and Innovative Sciences, aNYSTAR-Designated Center for Advanced Technology in New York State.This paper was recommended by Associate Editor F. G. B. De Natale.

The authors are with the Department of Computer Engineering, RochesterInstitute of Technology, Rochester, NY 14623 USA (e-mail: [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2011.2133970

variations caused by occlusions and changes in illuminationand pose. This task can become even more demanding whenthere is no prior information regarding the object’s appearance.Traditional object tracking algorithms model the appearance ofthe object in a generative way, while modern techniques viewtracking as a classification problem with temporal priors. In thelatter context, the goal is to locate an image region that belongsto the same class as the tracked object, under the constraintthat the new location is spatially adjacent to the previous one.This novel approach has received a lot of attention due to itsability to maintain accurate tracking despite severe conditions.

Based on the binary classification paradigm, a discrimina-tive tracking algorithm can be trained either offline or online.If an object model is not available, modern online learningapproaches for discriminative tracking utilize the boostingframework to incrementally train a binary classifier that willbe able to separate the object from the background. Thesemethods have solid theoretical background and have demon-strated very promising results. In similar spirit to the onlinetrained discriminative tracking approaches, we propose a noveltracking method where the problem is treated as a nearestneighbor classification with online learned distance. The ideais based on the stipulation that different appearances of theobject will be close to each other in a suitable feature space,i.e., the feature representation of the object’s appearance in thecurrent frame will be close to its feature representation in theprevious frame. However, measuring similarity with Euclideandistance, as is typically done in nearest neighbor classification,is not adequate, since Euclidean distance does not encodeany discriminative information. As a result, tracking is likelyto fail due to misclassification under mild changes [3]. Toovercome this impediment, we propose to learn an appropriatedistance metric that will yield small distances for differentappearances of the object and large distances for appearancesof non-objects (background). This approach is formally knownas distance metric learning (DML) and its goal is to discovera distance metric that will satisfy the constraints imposed byclass labels, i.e., keep data points from the same class closeto each other and map data points from different classes farapart.

In this paper, we advocate that the utilization of an onlinelearned distance can provide increased robustness compared tousing a predefined or a pre-learned metric. Object localizationis facilitated by searching for the region that minimizes thelearned distance between reference templates and the currentappearance of the object. The learned distance is updated

1051-8215/$26.00 c© 2011 IEEE

TSAGKATAKIS AND SAVAKIS: ONLINE DISTANCE METRIC LEARNING FOR OBJECT TRACKING 1811

online in order to incorporate various appearances of the objectas well as the background regions. In addition, instead of asingle reference template, a collection of templates called thetemplate library is used to capture the object’s appearance atvarious stages of the tracking.

The use of nearest neighbor classification in combinationwith online distance metric learning offers several advantagesin the context of object tracking.

1) It provides continuous values, rather than a binary outputas is done in typical binary classifiers. The benefitof having continuous values for similarity is that theycan be used as robust indicators of the changes in theappearance and assist in the detection of occlusions.Furthermore, the learned similarity can be used for es-timating the correlation between the current appearanceof the object and the examples in the template library,whereas in binary classification all previous templatesare equally weighted.

2) It offers an efficient way to transition from completelysupervised, to semi-supervised and unsupervised settingsdepending on the availability of prior information. Forexample, if an initial estimation of the object’s appear-ance is available, a batch DML algorithm can be appliedand used for tracking. On the other hand, when no priorinformation is available, as is the case of unsupervisedtracking, the first few frames can be used for initialmodeling and online metric learning will handle thesubsequence changes.

3) Similarly to other detection based tracking algorithms,this method has the capability to learn both positive andnegative examples, which offers increased discriminativepower compared to generative approaches that modelthe object’s appearance without any concern about thebackground.

4) There are numerous types of features that can be usedfor appearance encoding, ranging from edge responses toscale invariant feature transform (SIFT) features. In con-trast, boosting based techniques are usually constrainedto simple Haar-like features.

Although online distance metric learning is a powerfultechnique in the context of object tracking, the method used forthe representation of the object’s appearance is another criticalcomponent of the tracking algorithm. A novel approach inimage representation that has gained interest is the extractionof SIFT features over a regular grid. This type of featurerepresentation has been employed in state-of-the-art imageclassification schemes including [26] and [33]. The benefitsof using SIFT features for object representation include ro-bustness to small changes in appearance and illumination.However, the extraction of dense SIFT features generates ahigh dimensional representation of each region. Performingthe required operations of the proposed tracking scheme onthe high dimensional representations will severely degradethe execution speed of the system. We address this issue byutilizing the random projections (RPs) method for reducingthe dimensionality of the region representation. RPs is adata-independent linear method for dimensionality reduction

which offers performance acceleration without requiring priorknowledge on the object’s appearance. The use of RPs in thecontext of object tracking was first proposed in [27], where theauthors used a linear kernel and a template matching algorithmobject tracking across a network of cameras.

The rest of this paper is organized as follows. Section IIprovides an overview of different approaches in object trackingand their association with our work. The processes of featureextraction using SIFT and RPs is discussed in Section III.Distance metric learning is overviewed in Section IV, while theapplication of distance learning in object tracking is presentedin Section V. Experimental results are given in Section VI andthis paper concludes in Section VII.

II. Prior Work

Traditional template based tracking algorithms can becoarsely divided in two categories: offline and online. In theoffline approaches, a model of the object is either learnedoffline, using similar visual examples, or is learned duringthe first few frames. In both cases once the object model isgenerated, a predefined metric is used to identify the locationsin subsequent frames. Examples of this class of trackingalgorithms include kernel based methods [1] and appearancemodels [6]. These methods suffer from the limitation that oncethe model is created, it is not updated and, as a result, trackingmay fail due to unexpected object appearances. In addition,appearance based methods, such as [6], require training withall possible poses and illumination, a time consuming pro-cess. Furthermore, the use of a predefined metric such asthe Bhattacharyya coefficient [1], the Kullback–Leibler diver-gence [2], the normalized cross correlation or the sum-of-absolute differences [3], may not be appropriate for repre-senting the relationship between different appearances of theobject that may be encountered during tracking.

The second line of thought utilizes online learning to learnthe object’s changing appearance during tracking. Online orincremental template update was first suggested by Jepsonet al. in [15], where a mixture of three components wasproposed for the representation of the object, namely thestable, the transient and the noise component. Other examplesof online learning for tracking include the work of Matthewset al. [16], who suggested a template update strategy for main-taining an update appearance model, and the online subspacelearning approaches by Lim et al. [17] and Kim et al. [18], thatseek to incrementally update the object’s appearance subspace.Although these methods attempt to maintain an accurate modelfor the object, they are vulnerable to drift, i.e., the slowdegradation of the object’s model and the adaptation to thebackground’s appearance. A partial solution to the problem ofdrift was proposed in [19], where offline trained support vectormachines were used for separating the appearance of the objectto that of the background. Nevertheless, this approach is stillan offline based method and thus faces similar problems toother offline based methods.

A novel framework in visual tracking aims at alleviatingthe problems caused by drift without requiring an explicitmodel of the object. According to this framework, tracking is

1812 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011

treated as a binary classification problem where the objectiveis to both model the object’s appearance and separate it fromthe appearance of the background during tracking instead ofrelying on offline training. In [20], Avidan proposed the useof the classification score of the Adaboost classifier as theobject localization mechanism while replacing old or unreli-able weak classifiers to cope with changes in the appearanceof the object. In [12], Liu et al. proposed a semi-supervisedensemble tracking approach, where the particle filter frame-work was used for both object localization and collection ofunlabeled data points used to train an Ensemble classifier.In [22], Grabner et al. proposed an online tracking methodthat employed an online Adaboost classifier to train a binaryclassifier as new samples were presented. The trained classifierwas subsequently applied for object localization. The methodwas later extended in a semi-supervised setting in [13] usingthe SemiBoosting framework, where labeled examples wereobtained from the first frame and subsequent examples wereincorporated in a semi-supervised manner. In [21], Babenkoet al. further extended the use of online learning for trackingby utilizing the multiple instance learning framework, whereinstead of using a single positive example for the onlineAdaboost classifier, a set of image patches were introducedto capture the appearance of the object.

The utilization of online learned binary classifiers for in-ferring the presence or absence of the object in a candi-date window has been shown to outperform traditional of-fline generative approaches in various challenging scenarios.The proposed tracking algorithm is motivated by the onlinediscriminative framework, and employs an online learningapproach for modeling the object’s appearance that tackles theissues present in the offline approaches. In addition, a case-specific online learned distance metric is utilized to accuratelymodel the association between the object’s appearances atdifferent time instances. The proposed tracking algorithm issimilar to online learned binary classification approaches butdoes not impose the hard constraint of binary decisions. Weargue that this type of hard decisions can lead to drift, sincethe binary classifier would be forced to accept and incorporatethe appearance of the object even if it is not an optimal one,e.g., when the object is partially occluded or blurred due tofast motion. By utilizing a soft, distance-based classificationmetric, the proposed scheme can better model the changesin the object’s appearance and continuously track the object,even if it is partially occluded, without adversely modifyingthe appearance model which can lead to drift. In addition toovercoming the drift problem, the learned distance can be usedto compare the similarity between examples in the templatelibrary with the current object’s appearance and update thelibrary correspondingly. Boosting-like approaches, however,can utilize only the appearance of the object in the previousframe, which limits their ability to generate a robust andenduring object model. Furthermore, unlike the restriction ofsimple binary features used in boosting based approaches, ourproposed algorithm can incorporate any form of vectorizablefeature representation and thus offers a wider range of optionsdepending on the computational constraints and the accuracyrequirements.

III. Object Representation

A key decision in the design of an object tracking algorithmis the type of representation to be used. Historically, varioustypes of low level features have been considered. Objectrepresentation via raw pixel intensities is probably the simplestapproach and the most efficient one in terms of computationalcomplexity. However, changes in illumination can drasticallychange the representation of an object which may cause thetracker to fail. To resolve this problem, different approacheshave been presented that encode image regions using colorhistograms [1], spatio-temporal appearance models [24] andpart-based appearance models [25]. In this paper, we employthe SIFT [23] over a regular grid to obtain the initial repre-sentation of each candidate window. Once the initial featurerepresentation is obtained, we apply a linear dimensionalityreduction transform to reduce the computational complexityof object localization and distance metric learning.

A. SIFT Descriptors on a Regular Grid

The SIFT method is particularly successful for extractingfeatures that are invariant to small changes in the object’sappearance including scale and orientation and has been suc-cessfully applied in recent state-of-the-art systems for imageclassification [27], [33]. It has also been used as an extensionof traditional color histogram representations for mean shifttracking in [29].

The first stage of SIFT is the keypoints localization. Key-points correspond to maxima/minima of the difference ofGaussians (DoG) occurring at multiple stages. After someintermediate steps, such as removing keypoints with lowcontrast, eliminating responses along edges and assigning theappropriate orientations, the selected region is described bya 128 dimensional vector corresponding to a histogram oforiented gradients. Even though experimental results suggestthat the keypoints are stable, recent studies have shown thatobtaining the SIFT points on a regular grid can outperform thekeypoints obtained by the DoG [26]. In addition, the processof extracting the descriptors can be significantly acceleratedby using a piecewise-flat weighting approach, rather than aGaussian windowing function, as discussed in [30]. Morespecifically, instead of weighting the contribution of the ex-tracted gradients based on a Gaussian windowing function, thegradients are all weighted equally. Once all the gradients havebeen accumulated into a spatial bin, the bin is reweighted by aGaussian filter. This type of approximation incurs a minimumloss while being significantly faster. In our implementation,for each 50 × 50 pixel target window, 9 SIFT descriptorsare extracted using the VLFeat feature extraction library [30],resulting in a 9 × 128 = 1152 dimensional representation ofthe appearance of the window. The process of extracting the1152 dimensional vector takes about 15 ms using the VLFeatlibrary on a desktop computer.

B. Random Projections

The high dimensional representation of each image regionthat is generated by the dense SIFT descriptors can signif-icantly reduce the processing speed of the system due to

TSAGKATAKIS AND SAVAKIS: ONLINE DISTANCE METRIC LEARNING FOR OBJECT TRACKING 1813

the complexity of object localization and distance learning.Dimensionality reduction may be used to reduce the compu-tational load while keeping the most important informationintact. The benefit of dimensionality reduction is that theexecution of algorithms that depend on the dimensions of theirinput data, such as DML, can be significantly decreased.

In this paper, we propose the use of a novel approach indimensionality reduction, called RPs. RPs is a linear tech-nique for dimensionality reduction based on the Johnson–Lindenstrauss (JL) lemma [35]. Formally, the JL lemma statesthat for a finite collection of points Q in Rd with fixed0 < ε < 1 and β > 0, when

k ≥(

4 + 2β

ε2/2 − ε2/3

)In(#Q) ≈ O

(In #Q

ε2

)(1)

then a random matrix R ∈ Rk×d , where k ≤ d, whoseelements are drown i.i.d. from a zero mean, bounded variancedistribution satisfies

(1 − ε)√

k/d ≤ ‖ Rx − Ry ‖2

‖ −y ‖2≤ (1 + ε)

√k/d (2)

for every element x, y ∈ Q with probability exceeding 1 −(#Q)−β. An example of this type of distribution is the normaldistribution where each elements rij of R follow rij ∼ N(0, 1).In addition, in [28] it was shown that the elements rij of R

can be drawn i.i.d. from the following distribution:

rij =√

3

⎧⎨⎩

1, with probability 1/60, with probability 2/3−1, with probability 1/6.

(3)

The distribution in (3) is notably efficient, since it discards2/3 of the data that correspond to multiplications by zero. Fur-thermore, it can be implemented using fixed point arithmeticoperations consisting of only additions and subtractions if thescaling coefficient is factored out.

Compared to traditional dimensionality reduction ap-proaches such as principal components analysis (PCA), themajor benefit of RPs is the universality, i.e., the same RPmatrix can be used without the need for training based onstatistics, as is done is PCA. The fact that RPs do notrequire training is of particular importance, given that in mosttracking scenarios a model of the object we wish to trackmay not be available beforehand. In the context of tracking,we apply the RPs method at each candidate window afterthe dense SIFT descriptors are extracted, thereby reducing itsdimensionality from 1152 to 300 dimensions. This reductionis computationally efficient to apply, since it only requiresa matrix multiplication but offers significant computationalsavings.

IV. Distance Metric Learning (DML)

The distance between two vectors is often measured usingthe Euclidean distance or the inner product. Recent research inthe field of pattern classification has shown that using a moreappropriate distance metric can significantly improve results.In supervised DML, the objective is to learn a new distancethat will satisfy the pairwise constraints imposed by classlabel information. The new distance metric can be expressed

as either a Mahalanobis-like distance or equivalently as alinear transformation of the input data. Formally, the distancebetween two data points x and y ∈ Rn is given by

dG(x, y) =‖ x − y ‖2G= (x − y)T G(x − y) (4)

where G ∈ Rn×n is the distance metric. Alternatively, thematrix G can be decomposed as G = LTL in which case (4)becomes

dG(x, y) = (L(x − y))T L(x − y). (5)

The new distance is given by the Euclidean distance of thedata projected into the L subspace. The matrix G is requiredto be positive semidefinite to guarantee that the new distancewill satisfy the requirements for a metric, i.e., non-negativity,symmetry, and triangle inequality.

DML is closely related to subspace learning, as can be seenfrom (5). Subspace learning techniques have been extensivelyused in object tracking: for example, Eigentracking [6] appliesPCA of the training data to identify a subspace that is subse-quently utilized in tracking; linear discriminant analysis (LDA)[7] estimates a subspace that can both describe the data andprovide discriminating information. However, eigentracking isan unsupervised method, i.e., it does not take class informationinto consideration, while the supervised LDA assumes thateach class has a similar within-class distribution, which maynot be true in all cases.

DML is primarily concerned with identifying a transforma-tion that is optimized for classification. In general there are twonon-exclusive approaches in DML, offline or batch processingand online. In offline processing, the underlying assumption isthat all training data points and their corresponding labels areavailable during training. In online approaches, the assumptionis that a single example is presented at each step, and thedistance metric is modified so that it remains consistent withthe previous examples and also satisfies the newly presentedexample.

One of the earliest approaches that explicitly discussedlearning a distance function for classification in an offlinefashion was proposed in [5], where distance metric learningwas formulated as a constrained convex optimization problem.More recent approaches fall into the category of local DMLand attempt to learn a distance metric that will satisfy theconstraints in a local region around each data point insteadof all pairwise constraints. Local DML approaches are easierto handle and are directly related to the local nature of thenearest neighbor classifier. Examples of local DML include theneighborhood component analysis [8], the multiple collapsingclasses [9], and the large margin nearest neighbors [4].

In this paper, we utilize a recently proposed method for localDML called information theoretic metric learning (ITML)[10]. In ITML, given an initial estimate of the Mahalanobisdistance matrix, G0, the objective is to identify a new Ma-halanobis matrix, G that will meet two constraints; it will besimilar (in a KL divergence sense) to the original distancematrix and it will satisfy the class labels constraints. Assumingthat each distance matrix G corresponds to the inverse ofthe covariance matrix of an unknown multivariate Gaussian

1814 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011

distribution given by

p(x; G) = 12 exp(− 1

2 (x − y)T G(x − y))

= 1Z

exp(− 12dG(x, μ))

(6)

where μ is the mean and Z is a normalization constant,the Kullback–Leibler (KL) divergence can be employed as arobust metric of the correspondence between the two Gaussiandistributions G and G0, and is given by

KL(p(x; G0)) ‖ p(x; G))=

∫p(x; G0) log p(x;G0)

p(x;G) dx.(7)

The second objective of ITML is to satisfy the classconstraints, i.e., bring elements from the same class closer andmove elements from different classes far apart. For example,let x1, x2, and x3 be three vectors that belong to two classes C1

and C2 such that {x1, x2} ∈ C1 and {x3} ∈ C2. The objectiveof a DML algorithm, and ITML in this case, is to identify anew distance matrix G such that the new distance metric dG

preserves the relationships dG(x1, x2) < dG (x1, x3) and < dG

(x1, x2)< dG (x2, x3).Formally, the objective of ITML is to find the distance

matrix G so that

minG

kL(p(x; G0) ‖ p(x; G)

subject to(8)

dG(x, y) ≤ l , if label(x) = label(y)dG(x, y) ≥ u, if label(x) = label(y).

(9)

The solution of the problem is found by a LogDet optimiza-tion. In [10], the authors introduced appropriate slack variablesinto the formulation for the case when the exact solution is notfeasible due to unsatisfied constraints. A major benefit of theITML method is that is does not require the expensive eigen-decomposition and thus is more computationally efficient.

In addition to offline algorithms, various online DML ap-proaches have been presented. One of the earliest ones was thePOLA algorithm [31] that optimize a large-margin objective.However, the algorithm required an eigenvector computationat each iteration, which can be slow in practice. Anotherapproach is the recently proposed logdet exact gradient online(LEGO) [11] algorithm. Given two vectors ut and vt withdistance dGt (utvt) = yt and the target distance yt , the LEGOalgorithm updates the distance by minimizing

Gt+1 = arg minG>0

D(G, Gt)+η�(dG(ut, vt), yt) (10)

where D(G, Gt) is a regularization function, η is a regular-ization parameter, and �(yt, yt) is the loss between the targetdistance yt and the predicted distance yt . The solution to theminimization problem is given by

Gt+1 = Gt − η(y − yt)GtztzTt Gt

1 + η(y − yt)zTt Gtzt

(11)

where zt = ut −vt and y = dGt+1 (ut, vt). The estimated distancey is found by

y =ηytyt − 1 +

√(ηytyt − 1)2 + 4ηy

2

t

2ηyt. (12)

Fig. 1. Example demonstrating distance learning for object tracking.

V. Distance Metric Learning for Object Tracking

In this paper, we propose the use of batch and onlineDML for object tracking. To achieve this goal, we modelthe tracking problem as a binary classification problem. Thefirst class includes visual examples of the object’s appearancecollected during tracking, while the second class includesvisual examples of the background. The processing pipeline ofthe DML for tracking is shown in Fig. 1, where the problemof tracking is translated to that of binary classification. Thetwo classes are the object and the background shown in redand green boxes in the figure.

The background class is populated by regions that belong tothe background and are spatially close to the object’s location,while the object class is populated by regions containing theobject at various time instances. Each region is presented asa point in some high dimensional feature space. Ideally, wewould like the points corresponding to the background to bewell separated from the points corresponding to the object.However, this may not be the case. We deal with this issueby applying a distance learning algorithm which producesa better separation between the background and the object.Regions that belong to the object class collected from varioustime instances are grouped in the template library according tothe template update strategy. Using the learned distance andthe template library, the localization of the object in a newframe is achieved by locating the region that achieves thesmallest distance with the template library.

The proposed tracking algorithm, termed DMLTracking,utilizes both offline and online distance metric learning tolearn the appearance of the object and differentiate it with theappearance of the background. During initialization, a batchDML approach, the ITML, is employed to get a distance met-ric that is consistent with the initial appearance of the objectand the background. In other words, during bootstrapping, theobjective is to learn a distance metric that will produce smalldistances (more similar) between the appearances of the objectin the first few frames, while it will create large distances(less similar) between the appearances of the object and thebackground. Once the initialization is complete, we utilize anonline DML algorithm, the LEGO, to incrementally updatethe distance so that it remains consistent with the evolvingappearance of both the object and the background.

The use of the ITML and the LEGO algorithms for distancelearning offers three significant benefits with respect to the

TSAGKATAKIS AND SAVAKIS: ONLINE DISTANCE METRIC LEARNING FOR OBJECT TRACKING 1815

Fig. 2. Region search pattern.

requirements of tracking. First, both algorithms introduce aterm that enforces the smoothness in the variability of thelearned distance matrix. This is especially important for onlinelearning since we expect that the variation in the object’sappearance will be smooth and therefore the learned distancewould not significantly change in successive frames. In addi-tion, the ITML and the LEGO algorithms are designed basedon the large margin property. This property is well suited fortracking, especially in the application of ITML as bootstrap-ping, since we expect changes in the object’s and background’sappearance to take place. Introducing a large margin reducesthe probability of misclassifying the background as part ofthe object or vice versa. Last, the LEGO has low complexity,a vital trade for a real-time tracking algorithm.

A. Object Localization

Under the smooth motion assumption, the object’s locationin the new frame will be near its location in the previous frame,so that, in most cases, only a small region has to be searched.This is fortunate, as exhaustively searching the full image fora region that exhibits high similarity with the tracked object iscomputationally demanding, and is not suitable for a resourceconstrained system.

In this paper, we employ an iterative search method usingthe learned distance to localize the object in a new frame.The search pattern is initialized at the location of the object inthe previous frame. For each region in the search pattern, thedistance between the template library and the representationof the region is estimated and the region corresponding tothe smallest distance is found. If this region is at the center,it means the best possible location is found. Otherwise, thesearch pattern is realigned and the search continues. Fig. 2displays the pattern of locations that are searched at eachiteration. Each point corresponds to the center of a candidatesearch region with the same size as the target region. Weselected this type of search pattern because higher samplingrate near that center can lead to more accurate localization,while lower sampling rate further away from the center cansupport faster object motion with lower complexity.

Given the object’s appearance at time t − 1 and spatiallocation X, the object’s representation (SIFT over regular gridand random projections) is given by I(t−1, X). The objectiveof the template matching mechanism is to estimate the newobject location X given by

X = X + �X = min minX+�X∈S i={1...p}

dG(I(t,X + �X), Mi) (13)

where Mi for i = {1 . . . p} is the template library, dG is thelearned distance, and S is the search pattern. We discuss thedesign of the template library in Section V-C.

B. Distance Metric Update

In this paper, we employ a learned distance to estimatethe new location of the object in (13). We consider thescenario where the object’s characteristics are not known a-priori. As such, we cannot train a DML algorithm offline.However, relying solely on incrementally learning the appro-priate distance can lead to drift. We tackle this problem byfollowing a hybrid approach. We use (13) without any learneddistance metric (i.e., matrix G corresponds to the identitymatrix) for the first few frames (4 in this case) to collect abaseline appearance of the object and the background. Oncethe baseline representations are collected, we run the ITMLalgorithm to learn the object-specific distance metric as wellas the corresponding thresholds, as shown in (8) and (9). Thistype of bootstrap initialization offers two major benefits.

First, during the bootstrapping process, we assume that theappearance of the object and the appearance of the backgroundremain relatively stationary for the first few frames andtherefore we can apply a batch DML algorithm to obtain areliable estimate of the object’s and background’s appearance.Second, in addition to modeling the appearance of the objectand the background, we also obtain the distances betweenall examples from the same class and the distances betweenall the examples from different classes. We denote as l in(9), the maximum allowable distance between elements fromthe same class and u in (9) the minimum allowable distancebetween elements from different classes. In our experiments,we selected the threshold l to be 95% of the maximum distancebetween examples from the same class and u to be 5% of theminimum distance between examples from different classes.These values are used as thresholds and can subsequentlybe used to identify occlusions and guide the distance metricupdate strategy. Alternatively, if an estimate of the object’sappearance is available beforehand, such as the case of facetracking, we can obtain initial estimates of the learned distanceand threshold by training the DML algorithm offline usingnumerous visual examples.

In both cases, the offline and the bootstrap, the learneddistance should be updated to incorporate changes in theobject’s appearance. Following the previous notation, I(X; t) isthe representation of the image region that contains the object,M = {Mi |i = 1 . . . m } is the template library and J(Xj; t),where j = {1, . . . c}, are representations of the image regionsthat are close to the previous location X but do not contain theobject (background regions). The objective of online distancemetric learning is to update the distance matrix Gt to the newdistance matrix Gt+1 so that

dGt+1 (I(X; t),M << dGt (I(X; t)M) (14)

and

dGt+1 (I(X; t), J(Xj; t)) >>dGt (I(X; t),J(Xj; t)). (15)

In the context of tracking, we identified two types ofdistances that have to be updated. The first one is the distance

1816 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011

between the current object appearance and the elements of thetemplate library. This distance should be as small as possible,i.e., dGt+1 (I(X; t),M ≈ ∈. The second type is the distancebetween the current object appearance and the neighboringbackground regions J(Xj; t) that do not contain the object. Thisdistance should be large, i.e., dGt+1 (I(X; t),J(X; t) = u+δ, whereu is the threshold in (9) and δ is an arbitrary high valuedconstant. In this paper, the distance thresholds are learnedduring bootstrapping and remain static during tracking.

C. Template Library Update

To keep the template library up-to-date with the appearancesof the object, it should be updated by replacing old templateswith new ones. One of the benefits of using DML is thatdistances corresponding to similar template elements and canbe used to predict if and how the template library should beupdated. When a new frame is presented, the element with thehighest distance is replaced with the new one. Formally, let

St = mini

dG(I(X), Mi) (16)

be the minimum distance between the object’s appearanceat the estimated location X and template library elementsM = {Mi |i = 1 . . . m }. We can identify occlusion by compar-ing this distance with the threshold for the minimum allowabledistance u between the object and the background. If St > u,then we can infer that the new location contains somethingthat is more similar to the background than the object, whichis indicative of occlusion. If St ≤ u, then the window probablycontains the object and therefore it can be used as an examplefor the template library.

To update the template library, we first have to decide ifthe new appearance is informative enough to be included inthe library. A new appearance is considered a candidatefor the template library if St > St−1, i.e., if the similaritybetween the appearance of the object in the current frame andthe examples in the template library is less than the similarityachieved in the previous frame. If this assertion holds, thenthe template element that achieves the smallest distance (mostsimilar) to the object’s current appearance is replaced with thenew appearance. The proposed update mechanism is motivatedby the fact that a template element that is very similar to theobject’s current appearance carries little information, whereasa template element with higher distance is more informative.By maintaining a small number of template examples, insteadof a single previous one, we ensure that if the appearance ofthe object suddenly changes in a frame and then returns to theoriginal one, at least one template element will be consistentwith the original object appearance and therefore will not beevicted from the template library during update.

D. Overview of the Proposed Tracking Algorithm

In this section we provide an algorithmic description ofthe proposed tracking method. The algorithm consists of twoparts; the bootstrapping process and the tracking process.Bootstrapping is the initialization of the object’s model givenits appearance in the first few frames and is described by thefollowing steps:

Algorithm 1 Bootstrapping process

Input: Initial object location from first n frames1. Collect image windows from the same location

I+ = {I+1 , . . . , I+

m}(positive samples) and windowsaround the location I− = {I−

1 . . . , I−n }

(negative samples)from first n frames using (13) with G = 1, whereI is the identity matrix.

2. Perform feature extraction on the collected samples toobtain I+ and I−

3. Run batch distance metric learning (ITML) such thatdG(I+, I−) ≥ dG(I+, I+) for all examples positive andnegative examples

Output: Distance matrix G, thresholds l and u

Algorithm 2 Tracking process

Input: Object location, distance metric Gt , thresholds l

and u, template library Mt

1. Collect candidate windows using the search patternfrom previous location (t, X + �X)1.1. Perform feature extraction to get I(t, X+�X)1.2. Measure the distance to all the elements from

template library Mt and return the one thatachieved the smallest distance St

2. If the window that achieves the distance St does notcorrespond to the central region of the search pattern,update location and return to step 1, otherwise

3. Set new object location equal to the location of thecentral region and calculate the distance with thetemplate library examples using (16)

4. If distance St , where u is a threshold4.1. Collect negative examples I− (around object’slocation) and positive example I+ (the windowcontaining the object)4.2. Update distance metric matrix Gt+1 using (10)4.3. If St > St−1, update template library to Mt+1

by removing the most similar element (smallestdistance) according to the updated distance metricGt+1.

5. If distance St > u

5.1. Set the object as lost/occluded and stop the updatesOutput: Object location X, updated distance metric Gt+1,updated template library Mt+1

During the bootstrapping process, it is assumed that theobject’s appearance will remain relatively the same for the firstfew frames (4 in our experiments), given its location in the firstframe and therefore applying (13) with the identity matrix willnot result in misclassification. In our experimental results, thisassumption did not cause any instability to the ITML. Duringruntime, for each new frame, object localization and modelupdate are achieved according to the following process:

TSAGKATAKIS AND SAVAKIS: ONLINE DISTANCE METRIC LEARNING FOR OBJECT TRACKING 1817

Fig. 3. Coke Can sequence. A Coke can is shown in a cluttered backgroundwhile it undergoes changes in appearance due to out-of-plane rotations andchanges in illumination caused by the direct light source.

VI. Experimental Results

A series of experiments were conducted to evaluate thetracking accuracy of the proposed system. For all test se-quences, the same set of parameters was used, i.e., the systemparameters were not adjusted for each video sequence, whichcorresponds to a more realistic scenario and is more chal-lenging. Regarding object localization, we utilized the searchpattern shown in Fig. 2 in an iterative localization process.Each candidate image region was resized to 50 × 50 pixelsand 9 SIFT descriptors were extracted. The 1152 dimensionalrepresentation of the window was reduced to 300 dimensionsusing the RPs methods. In order to obtain an initial modelof the object and the background, the first four frames werecollected and the object’s location was extracted using anidentity matrix as the distance matrix. In addition, for eachframe eight regions of background were collected around theestimated location of the object in eight spatially adjacentlocations at the following orientations: 0°, 45°, 90°, 135°,180°, 225°, 270°, and 315°. For the bootstrapping process,the ITML was applied with a gamma value set to 10 and thenumber of neighbors set to 5. To maintain a rich representationof the object’s appearance, four templates were used fromprevious time instances. These templates were updated onlyif the newly obtained object representation was below theappropriate threshold. The same threshold was used to updatethe learned distances using the LEGO algorithm. The learningparameter η of the LEGO algorithm was set to 0.6 for learningthe object’s appearance and 0.1 for learning the background’sappearance. Updating the template library and the learned dis-tance on every frame, our MATLAB implementation currentlyoperates at around 5 f/s in a dual core desktop computer.

The objective of the experiments was to compare the perfor-mance of the proposed DMLTracking algorithm with state-of-the-art object tracking algorithms in challenging scenarios. Theexperiments were carried out using publicly available videosequences. The video sequences shown in Figs. 3–10 are calledCoke Can, Coupon, Sylvester, Tiger 2, David, Girl, Face Oc-

Fig. 4. Coupon sequence. A booklet is tracked while it changes its appear-ance (folding) and a new object with similar appearance is introduced in thescene.

Fig. 5. Sylvester sequence. A toy animal is shown moving under significantillumination changes caused by direct light source while significantly changingappearance due to out-of-plane rotation.

cluded, and Face Occluded 2 and were downloaded from [32],while the sequence Car chasing in Fig. 11 was downloadedfrom Google videos. This set of video sequences representschallenging scenarios for an object tracking algorithm dueto the significant variations of the object’s and the back-ground’s appearance. The sequences in Figs. 6, 9, 10, and 11investigate the trackers’ ability to maintain accurate trackingwhen severe occlusions take place. The occlusions shownin these sequences can confuse online trackers so that theoccluding object, instead of the actual object, is modeledand tracked, which causes drift. The sequences shown inFigs. 3, 5–8, 10, and 11 evaluate the ability of the trackerto maintain an accurate object model despite temporary andpermanent changes in appearance. For example, the sequencein Fig. 7 shows a person walking while changing his pose,removing his glasses and changing expression. Another typeof obstacle that a tracker must overcome is that of similarlylooking objects that are part of the background. Characteristicexamples of this type of challenge are shown in Figs. 4 and 11.

1818 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011

Finally, dramatic changes in illumination, such as the onesshown in Figs. 5 and 11, can also create confusion to atracker. We note that we explored generic object tracking, asopposed to specific objects such as human faces. Trackinga generic object is far more challenging than a specific,predefined one, since the lack of a prior model makes theidentification of occlusions and changes in appearance verychallenging. The performance of the DMLTracking algorithm(shown using red solid lines) was compared with two state-of-the-art online tracking algorithms, the semi-supervised on-line boosting (SemiBoost) [13] (shown using green dot-dashedlines) and the multiple instance learning tracking (MILTrack)[21] (shown using blue double-dashed lines).

The first sequence, called Coke Can (Fig. 3), examinesthe case of tracking a rigid object that undergoes changes inappearance and illumination in a cluttered scene. The changesin appearance are due to the rotation of the object and theillumination changes are caused by moving under a direct lightsource. We observe that the DMLTracker is able to maintainan accurate tracking in contrast to the Semiboost tracking thatlooses the object in several frames. The MILTracker achievessimilar performance.

The second sequence, called the Coupon Book (Fig. 4),examines the case where the object undergoes a permanentchange in appearance while a similarly looking object is alsopresent in the scene. This example illustrates the ability ofthe tracker to update the appearance model of the object andseparate the tracked object from another object with similarinitial appearance.

We observe that the DMLTracker correctly updates theappearance of the object and does not become confused bythe similarly looking object. Similar results are evident for theMILTracker. The Semiboost tracker on the other hand, buildsan initial model from the first frame and is not able to adaptto the new appearance with causes the tracker to lock on thesimilar object and not the tracked object.

The Sylvester sequence (Fig. 5) examines the ability ofthe tracker to maintain accurate tracking in long sequences(1300 frames) while going through changes in appearance dueto pose and illumination. The object is a toy animal that isinitially presented under a direct light source.

We observe that during the duration of the tracking, theobject changes appearance due to rotation while moving inand out of the direct light source, which causes significantchanges in appearance. Furthermore, the object is movingin a cluttered scene, and this makes the tracking even morechallenging. We observe that the DMLTracker correctly tracksthe object despite the challenging conditions and achievessimilar performance with the MILTracker. The Semiboosttracker also maintains tracking; however it provides lowerlocalization accuracy.

The Tiger 2 (Fig. 6) sequence presents the case where theobject suffers severe occlusions while changing appearance.More specifically, a toy animal is shown moving behind aplant. The cluttered foregrounds, in addition to changes inappearance of the object, create very challenging conditions.We observe that the Semiboost Tracker loses the object inmany frames and locks on to a region similar to the region

Fig. 6. Tiger 2 sequence. A toy animal is tracked while it is partiallyoccluded by a plant and changing appearance.

that causes the occlusion. The DMLTracker achieves higheraccuracy, but it eventually suffers from drift at the end ofthe sequence. The drift is caused by the inability of theDMLTracker to accurately model the object under all possibleocclusion conditions. In this scenario, the MILTracker achievesthe best results.

The next sequence, called David Indoor (Fig. 7), studiesthe case of realistic face tracking. The face undergoes severalchanges in appearance due to illumination, changes in pose,and permanent appearance changes (glasses). More specifi-cally, the person enters the scene from a dark room whichimmediately creates a challenge with respect to illuminationrobustness. Subsequently, the face undergoes changes in size,viewpoint, and appearance (glasses).

In this sequence, the DMLTracker achieves the best resultswith much higher localization accuracy compared to boththe MILTracker and the Semiboost Tracker. The increasedperformance is a consequence of the double model updatemechanism via the updates in the distance metric and theupdates in the template library. The Semiboost tracker oftenfails to update the appearance and drifts, while the MILTrackeris more stable but less accurate than DMLTracker.

A similar scenario is examined in the Girl (Fig. 8) sequencewith more challenging changes in the appearance of the facedue to the out-of-plane rotation, changes in scale and the pres-ence of another similarly looking object (the man’s face). TheDMLTracker is able to update the object’s appearance modelwithout suffering drift. The MILTracker and the Semiboosttracker maintain tracking, however they exhibit lower accuracyand in one instance (frame 423) the Semiboost tracker isconfused by the other face.

The previous sequences examined the behavior of the pro-posed tracking algorithm, as well as state-of-the-art trackingalgorithm in scenarios where the object undergoes severechanges in appearance, usually due to out-of-plane rotationsand illumination changes. In the next two sequences, weexamine the behavior of these algorithms when significantocclusions take place. The Face Occluded (Fig. 9) and FaceOccluded 2 (Fig. 10) sequences present situations of face

TSAGKATAKIS AND SAVAKIS: ONLINE DISTANCE METRIC LEARNING FOR OBJECT TRACKING 1819

Fig. 7. David Indoor sequence. A face tracking sequence is presented wherethe face is shown in different scales, from different viewpoints and underdifferent appearances, e.g., expressions and glasses.

Fig. 8. Girl sequence. A face is tracked while changing appearance andbeing occluded by another face.

tracking under partial and complete occlusions. The FaceOccluded sequence is less challenging than the Face Occluded2, since the appearance of the face does not change during thesequence.

We observe that all three algorithms correctly maintain thetracking without drifting. We note, however, that the differencein localization accuracy presented in Table I is due to theunclear nature of ground truth in this sequence. In other words,the tracking window may move to cover most of the object thatis visible or it may wait until the object is visible again. TheDMLTracker and the MILTracker follow the former approach,while the Semiboost follows the latter approach which causesthe apparent inconsistency in the localization results.

In addition to overcoming the drift problem, the DML-Tracker is able to report significant. This is achieved becausethe current view of the object starts to become more similarto the background and thus its distance with the rest of thetemplate library examples starts to exceed the same classthreshold.

Fig. 9. Face Occluded sequence. An occluded face is tracked.

Fig. 10. Face Occluded 2 sequence. A face is tracked while undergoingocclusions with significant variations in appearance.

The problems caused by occlusion are more evident in thesecond sequence shown in Fig. 10, called Face Occluded 2,where the object undergoes changes in appearance in additionto undergoing occlusions.

In this scenario, the optimum behavior for a tracker is toupdate the model when the face changes appearance (movingthe head, adding the hat) while being able to identify oc-clusions and stop updating the model before it causes drift.In this scenario the DMLTracker achieves superior accuracycompared to both the MILTracker and the Semiboost Tracker.

The last case we explore involves tracking an object froma moving camera under challenging illumination changes andocclusion. This scenario is presented in Fig. 11, where a car isfollowed by a helicopter mounted camera while it is movingon a highway.

As we can see, the car undergoes severe illuminationconditions, as shown in the second image on the top row,as well as occlusions. We observe that the proposed trackingscheme (shown in red) is able to handle the demanding

1820 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 12, DECEMBER 2011

Fig. 11. Car chasing sequence. A car is tracked while moving at high speedclose to other similarly looking cars and suffering occlusions by the trafficsigns.

requirements and maintain accurate tracking. On the otherhand, both the MILtracker (in green) and the SemiBoost (inblue) fail to maintain tracking. The MILtracker fails to followthe car when it goes under the traffic signs and completelyloses the object afterwards. Semiboost is more robust, sinceit handles the first complete occlusion, but suffers from driftthat eventually causes complete failure of the tracker duringthe second occlusion.

To supplement the results shown in Figs. 3–11 we provideTable I with the localization accuracy of four state-of-the-art object tracking algorithms as well as the proposed one.These algorithms include the online adaboost tracker [22],the semiboost tracker [13], the fragments based tracker (Frag-Track) [34], and the MILTracker [21]. The results in the tablecorrespond to mean distance in pixels between the centerof the tracking window and the center of the ground truthwindow after five independent trials. The best performance isindicated by a smaller number corresponding to closer matchbetween the predicted center and the ground truth center. Theresults marked in red indicate the best performance and theones in green correspond to the second best. We observe thatthe proposed algorithm, the DMLTracking, achieves the bestperformance in six out of nine sequences and the second bestin two others. The accuracy in the Coke Can is comparableto the SemiBoost tracker and significantly better than theMILTracker. In the sequences Coupon, David, and OccludedFace 2, the DMLTracking algorithm outperforms all otheralgorithms by a large margin, while in the Car chasingsequence, the DMLTracking algorithm is the only one thatcorrectly maintains the tracking through the sequence andachieves the best localization results. The only case where theproposed algorithm does not achieve top performance is in thecase of Occluded Face. The lower performance is most likelydue to the fact that the proposed DMLTracking algorithm, aswell as the MILTracker, try to select the largest portion ofthe visible object, while methods such as SemiBoost and theFragTrack try to estimate where the object might be behindthe occlusion and select that region.

TABLE I

Localization Accuracy on Generic Object Tracking

VII. Conclusion

In this paper, we proposed the use of an online discrimina-tive learning mechanism for robust tracking of objects withoutany prior model regarding their appearance. The proposedscheme employs distance metric learning to reliably representthe similarity between different appearances of the object aswell as the difference in appearance between the object andthe background. The object’s location in a new frame is foundby selecting the region that minimizes the distance relative to alibrary of templates. Both the distance metric and the templatelibrary are updated online in order to adapt to the object’sappearance as well as to changes in illumination and pose.We employed a bootstrapping process for the initial estimationof the object’s appearance. The representation of each imagewindow is based on a combination of SIFT features extractedover a regular grid and Random Projections for dimensionalityreduction. Experimental results suggest that the proposedalgorithm is robust to changes in pose and illuminations andocclusions and achieves performance comparable to state-of-the-art tracking algorithms.

Acknowledgment

The authors would like to thank the anonymous reviews fortheir comments.

References

[1] D. Commaniciu, V. Ramesh, and P. Meer, “Kernel-based object track-ing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577, May 2003.

[2] A. Elgammal, R. Duraiswami, and L. Davis, “Probabilistic tracking injoint feature-spatial spaces,” in Proc. IEEE Int. Conf. Comput. VisionPattern Recognit., vol. 1. Jun. 2003, pp. 781–788.

[3] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACMComput. Surveys, vol. 38, no. 4, pp. 1–45, 2006.

TSAGKATAKIS AND SAVAKIS: ONLINE DISTANCE METRIC LEARNING FOR OBJECT TRACKING 1821

[4] K. Q. Weinberger and L. K. Saul, “Distance metric learning for largemargin nearest neighbor classification,” J. Mach. Learning Res., vol. 10,pp. 207–244, Dec. 2009.

[5] E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance metric learningwith application to clustering with side-information,” in Advances inNeural Information Processing Systems. Cambridge, MA: MIT Press,2003.

[6] M. J. Black and A. D. Jepson, “EigenTracking: Robust matching andtracking of articulated objects using a view-based representation,” Int.J. Comput. Vision, vol. 26, no. 1, pp. 63–84, 1998.

[7] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discrim-inative tracking features,” IEEE Trans. Pattern Anal. Mach. Intell., vol.27, no. 10, pp. 1631–1643, Oct. 2005.

[8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbor-hood component analysis,” in Advances in Neural Information Process-ing Systems. Cambridge, MA: MIT Press, 2004.

[9] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” inAdvances in Neural Information Processing Systems. Cambridge, MA:MIT Press, 2006.

[10] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proc. Int. Conf. Mach. Learning, 2007, pp.209–216.

[11] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metriclearning and fast similarity search,” in Advances in Neural InformationProcessing Systems. Cambridge, MA: MIT Press, 2008.

[12] H. Liu and F. Sun, “Semi-supervised ensemble tracking,” in Proc. Int.Conf. Acoust. Speech Signal Process., 2009, p. 1648.

[13] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-lineboosting for robust tracking,” in Proc. Eur. Conf. Comput. Vis., 2008,pp. 234–247.

[14] M. Isard and A. Blake, “CONDENSATION: Conditional density prop-agation for visual tracking,” Int. J. Comput. Vision, vol. 29, no. 1, pp.5–28, 1998.

[15] A. D. Jepson, D. J. Fleet, and T. F. El-Maraghi, “Robust onlineappearance models for visual tracking,” IEEE Trans. Pattern Recognit.Mach. Intell., vol. 25, no. 10, pp. 1296–1311, Oct. 2003.

[16] L. Matthews, T. Ishikawa, and S. Baker, “The template update problem,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 6, pp. 810–815,Jun. 2004.

[17] J. Lim, D. Ross, R. S. Lin, and M. H. Yang, “Incremental learning forrobust visual tracking,” Int. J. Comput. Vision, vol. 77, nos. 1–3, pp.125–141, 2008.

[18] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, “Face tracking andrecognition with visual constraints in real-world videos,” in Proc. IEEEInt. Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8.

[19] S. Avidan, “Support vector tracking,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 26, no. 8, pp. 1064–1072, Aug. 2004.

[20] S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 29, no. 2, pp. 261–271, Feb. 2007.

[21] B. Babenko, M. H. Yang, and S. Belongie, “Visual tracking with onlinemultiple instance learning,” in Proc. IEEE Int. Conf. Comput. Vis.Pattern Recognit. Workshops, Jun. 2009, pp. 983–990.

[22] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-lineboosting,” in Proc BMVC, vol. 1. 2006, pp. 47–56.

[23] D. G. Lowe, “Object recognition from local scale-invariant features,” inProc. Int. Conf. Comput. Vision, vol. 2. 1999, pp. 1150–1157.

[24] N. Gheissari, T. Sebastian, P. Tu, J. Rittscher, and R. Hartley, “Personreidentification using spatiotemporal appearance,” in Proc. IEEE Int.Conf. Comput. Vis. Pattern Recognit., 2006, pp. 1528–1535.

[25] J. Li, S. K. Zhou, and R. Chellappa, “Appearance context modelingunder geometric context,” in Proc. Int. Conf. Comput. Vis., 2005.

[26] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramidmatching using sparse coding for image classification,” in Proc. IEEEInt. Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1794–1801.

[27] G. Tsagkatakis and A. Savakis, “A random projections model for objecttracking under variable pose and multi-camera views,” in Proc. Int. Conf.Distributed Smart Cameras, 2009, pp. 1–7.

[28] D. Achlioptas, “Database-friendly random projections: Johnson–Lindenstrauss with binary coins,” J. Comput. Syst. Sci., vol. 66, no.4, pp. 671–687, 2003.

[29] H. Zhou, Y. Yuan, and C. Shi, “Object tracking using SIFT features andmean shift,” Comput. Vis. Image Understanding, vol. 113, no. 3, pp.345–352, 2009.

[30] A. Vedaldi and B. Fulkerson. Vlfeat: Feature Extraction Library [On-line]. Available: http://www.vlfeat.org

[31] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learn-ing of pseudo-metrics,” in Proc. Int. Conf. Mach. Learning, vol. 94.2004.

[32] [Online]. Available: http://vision.ucsd.edu/∼bbabenko/project−miltrack.shtml

[33] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories,” in Proc.IEEE Int. Conf. Comput. Vis. Pattern Recognit., vol. 2. 2006, pp. 2169–2178.

[34] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based trackingusing the integral histogram,” in Proc. IEEE Int. Conf. Comput. Vis.Pattern Recognit., 2006, pp. 798–805.

[35] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mapping intoa Hilbert space,” in Proc. Conf. Mod. Anal. Probab., 1984, pp. 189–206.

Grigorios Tsagkatakis (S’08) received the B.S.and M.S. degrees in electronics and computer en-gineering from the Technical University of Crete,Crete, Greece, in 2005 and 2007, respectively. He iscurrently working toward the Ph.D. degree in imag-ing science from the Center for Imaging Science,Rochester Institute of Technology (RIT), Rochester,NY.

He was a Teaching and Research Assistant withthe Department of Electronics and Computer Engi-neering and has worked on various European funded

projects from 2003 to 2007. He is currently a Teaching Assistant with theDepartment of Computer Engineering, RIT, and is a Research Assistant withthe Real Time Computer Vision Laboratory, RIT, working on human-computerinteraction and computer vision for smartphones. His current research interestsinclude computer vision and machine learning.

Dr. Tsagkatakis received the Best Paper Award from the Western New YorkImage Processing Workshop in 2010 for his paper “A framework for objectclass recognition with no visual examples.”

Andreas Savakis (M’91–SM’97) received the B.S.(summa cum laude) and M.S. degrees from OldDominion University, Norfolk, VA, and the Ph.D.degree from North Carolina State University,Raleigh, all in electrical engineering.

He is currently a Professor, the Head of theDepartment of Computer Engineering, and a Ph.D.Faculty Member in imaging science and computingand information sciences with the Rochester Insti-tute of Technology (RIT), Rochester, NY. Beforejoining RIT, he was with Eastman Kodak Company,

Rochester. His research has generated 11 patents and over 80 publicationsin journals, conferences, and book chapters. His current research interestsinclude computer vision, image processing and medical imaging algorithms,and their implementation on mobile systems, multi-camera environments, andhigh performance platforms.

Dr. Savakis serves as an ABET evaluator for electrical engineering andcomputer engineering programs. His activities were recognized by the IEEEThird Millennium Medal from the IEEE Rochester Section in 2000 and bythe NYSTAR Technology Transfer Award for Economic Impact in 2006.