image and vision computing - unipa · 2013-12-06 · 1. introduction personal photo albums show...

11
An on-line learning method for face association in personal photo collection Liliana Lo Presti , Marco La Cascia University of Palermo, Italy abstract article info Article history: Received 10 May 2011 Received in revised form 6 January 2012 Accepted 26 February 2012 Keywords: Face descriptor Data association On-line learning Semi-supervised learning Digital libraries Due to the widespread use of cameras, it is very common to collect thousands of personal photos. A proper organization is needed to make the collection usable and to enable an easy photo retrieval. In this paper, we present a method to organize personal photo collections based on whois in the picture. Our method consists in detecting the faces in the photo sequence and arranging them in groups corresponding to the probable identities. This problem can be conveniently modeled as a multi-target visual tracking where a set of on-line trained classiers is used to represent the identity models. In contrast to other works where clustering methods are used, our method relies on a probabilistic framework; it does not require any prior information about the number of different identities in the photo album. To enable future comparison, we present experimental results on a public dataset and on a photo collection generated from a public face dataset. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general, such photo collections are characterized by the presence of few persons, and photos are taken in several places. Based on common sense, users would like to browse their own collections considering properties such as when and where a photo was taken and on who is in the photo [1]. While the rst two properties are based on contextual information and they can be extracted if available from the EXIF data embedded within each photo [2], organizing the sequence based on who is in the photo is still a very challenging problem that requires each face to be associated to a tag. In this paper, we present a method aimed to minimize the user effort in tagging photos. Our method detects the faces in the photo sequence to process, and arranges them in groups corresponding to identities. Users can tag faces associating a label to the whole face group. A similar approach is followed also in [36]. However, these methods apply clustering techniques to get a coarse face partition later rened by applying post-processing steps. In case of personal photo album organization, clustering-based methods do not consider the mutual exclusivity constraint: two faces in the same photo cannot be assigned to the same partition/ identity. Our method, instead, has been developed to exploit this important property and adopts a semi-supervised approach for grouping similar faces. The detected faces are processed in bunches (each bunch corresponding to the set of faces detected in the same photo) and each face is associated to an identity. Identities are then iteratively estimated by means of the discovered associations. From this point of view, our method reduces to a multi-target visual tracking where the detected faces are used to on-line estimate and update an appearance model for each identity. The main difference between standard tracking applications and our domain is the lack of temporal smoothness for the observations. Motivated by the success of recent works such as [710], where visual tracking is performed by modeling the appearance information via on-line trained classiers, in this paper we adopt a similar approach to capture the face information available across the sequence while new associations between the identities and the depictions are discovered in each processed photo. Our main contribution is the use of a classier ensemble to represent the set of identities, and a new probabilistic framework to enforce the mutual exclusivity constraint; such framework ensures the collaboration among the classiers and permits to label the faces to be used for retraining the whole ensemble (see Fig. 1). In our method, no prior information about the number of identities in the photo album is needed, in contrast with clustering methods (for instance K-means) where this information is sometimes required. If prior information about the collection is available, such as already tagged photos, it could be possible to train ofine the classiers and use them to classify faces detected in the new incoming photo sequence. However, an on-line strategy better accounts for the discovery of new identities when matches with the already found ones become too uncertain. In fact, in photo collections it is very common that new identities, never seen in previous pictures, appear in the collection. In the following, we will present some of the most relevant related works, then we will give an overview of our method and we will explain deeply its main components. Finally we will present Image and Vision Computing 30 (2012) 306316 This paper has been recommended for acceptance by Matti Pietikainen. Corresponding author. Tel.: + 39 091 238 42 524. E-mail address: [email protected] (L. Lo Presti). 0262-8856/$ see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2012.02.011 Contents lists available at SciVerse ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

Image and Vision Computing 30 (2012) 306–316

Contents lists available at SciVerse ScienceDirect

Image and Vision Computing

j ourna l homepage: www.e lsev ie r .com/ locate / imav is

An on-line learning method for face association in personal photo collection☆

Liliana Lo Presti ⁎, Marco La CasciaUniversity of Palermo, Italy

☆ This paper has been recommended for acceptance b⁎ Corresponding author. Tel.: +39 091 238 42 524.

E-mail address: [email protected] (L. Lo Presti)

0262-8856/$ – see front matter © 2012 Elsevier B.V. Alldoi:10.1016/j.imavis.2012.02.011

a b s t r a c t

a r t i c l e i n f o

Article history:Received 10 May 2011Received in revised form 6 January 2012Accepted 26 February 2012

Keywords:Face descriptorData associationOn-line learningSemi-supervised learningDigital libraries

Due to the widespread use of cameras, it is very common to collect thousands of personal photos. A properorganization is needed to make the collection usable and to enable an easy photo retrieval. In this paper,we present a method to organize personal photo collections based on “who” is in the picture. Our methodconsists in detecting the faces in the photo sequence and arranging them in groups corresponding to theprobable identities. This problem can be conveniently modeled as a multi-target visual tracking where aset of on-line trained classifiers is used to represent the identity models. In contrast to other works whereclustering methods are used, our method relies on a probabilistic framework; it does not require any priorinformation about the number of different identities in the photo album. To enable future comparison, wepresent experimental results on a public dataset and on a photo collection generated from a public facedataset.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Personal photo albums show characteristics that make them quitedifferent from a generic image database. In general, such photocollections are characterized by the presence of few persons, andphotos are taken in several places. Based on common sense, userswould like to browse their own collections considering propertiessuch as when and where a photo was taken and on who is in thephoto [1]. While the first two properties are based on contextualinformation and they can be extracted – if available – from the EXIFdata embedded within each photo [2], organizing the sequencebased on who is in the photo is still a very challenging problem thatrequires each face to be associated to a tag. In this paper, we presenta method aimed to minimize the user effort in tagging photos. Ourmethod detects the faces in the photo sequence to process, andarranges them in groups corresponding to identities. Users can tagfaces associating a label to the whole face group. A similar approachis followed also in [3–6]. However, these methods apply clusteringtechniques to get a coarse face partition later refined by applyingpost-processing steps.

In case of personal photo album organization, clustering-basedmethods do not consider the mutual exclusivity constraint: “twofaces in the same photo cannot be assigned to the same partition/identity”. Our method, instead, has been developed to exploit thisimportant property and adopts a semi-supervised approach forgrouping similar faces. The detected faces are processed in bunches(each bunch corresponding to the set of faces detected in the same

y Matti Pietikainen.

.

rights reserved.

photo) and each face is associated to an identity. Identities are theniteratively estimated by means of the discovered associations. Fromthis point of view, our method reduces to a multi-target visual trackingwhere the detected faces are used to on-line estimate and update anappearance model for each identity. The main difference betweenstandard tracking applications and our domain is the lack of temporalsmoothness for the observations.

Motivated by the success of recent works such as [7–10], wherevisual tracking is performed by modeling the appearance informationvia on-line trained classifiers, in this paper we adopt a similarapproach to capture the face information available across the sequencewhile new associations between the identities and the depictions arediscovered in each processed photo. Our main contribution is the useof a classifier ensemble to represent the set of identities, and a newprobabilistic framework to enforce the mutual exclusivity constraint;such framework ensures the collaboration among the classifiers andpermits to label the faces to be used for retraining the whole ensemble(see Fig. 1).

In ourmethod, no prior information about the number of identitiesin the photo album is needed, in contrast with clusteringmethods (forinstance K-means) where this information is sometimes required. Ifprior information about the collection is available, such as alreadytagged photos, it could be possible to train offline the classifiers anduse them to classify faces detected in the new incoming photosequence. However, an on-line strategy better accounts for the discoveryof new identities when matches with the already found ones becometoo uncertain. In fact, in photo collections it is very common that newidentities, never seen in previous pictures, appear in the collection.

In the following, we will present some of the most relevant relatedworks, then we will give an overview of our method and we willexplain deeply its main components. Finally we will present

Page 2: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

Fig. 1. Schema representing our framework: faces detected in the current photo are classified by the classifiers associated to each already discovered identity. The probabilitiescorresponding to the classification outcomes are used in a probabilistic framework to fuse the decisions from the classifier ensemble. Once the set of associations with maximaljoint probability is found, the faces are used to update the classifiers' parameters and the person-clusters.

307L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

experimental results on a public dataset and on a photo sequencegenerated from a public dataset.

2. Related works

Personal photo albums are photo collections depicting individualsbelonging mainly to the same family or the same social group. Ingeneral, the number of depicted individuals/identities in the wholecollection is small but unknown.

The most natural way to browse a photo collection is, perhaps,considering who is in the photo based on tags attached to each face.Tags can be assigned by the user considering a face at a time butsuch task is very boring and a user would soon give up when toomany photos have to be processed. New methods enabling a user toeasily and rapidly tag photos were presented in [11,12]. In suchmethods, the already tagged photos are used to suggest likely tagsfor identifying faces in untagged photos. Therefore, face recognition/matching techniques may be applied to perform tag association,considering that such task must be performed in the “wild”, that isthe detected faces are affected by large pose variations and abruptillumination changes.

Recently, new applications such as iPhoto [1] and Picasa [13]provide tools for face detection and recognition that are used to suggestlikely tags for each face detected within the collection. In particular,Picasa presents a tool closely related to our work where faces areorganized in groups each one containing faces of the same person. Theuser is assisted in the tagging task asking for a confirmation for everysuggested label before propagating it to all the faces within the samegroup. However, it can be observed that results can be affected byover-clustering, that is several clusters can correspond to the sameperson (see [14]).

Nevertheless the success of these commercial semi-automaticapplications, still the problem of photo organization, remains verychallenging. On the one hand, faces in the “wild” processing are nota solved problem and new face descriptors and/or learning algorithmsare needed to enhance identity recognition; on the other hand, newmethods aiming to minimize user's interactions are required, movingfrom semi-automatic to fully automatic photo organization.

Many previously proposed papers [3–5] use clustering methods togroup faces, each cluster representing an identity. All these methodsdo not explicitly consider the mutual exclusivity constraint. Onlysome works, as for example [4,15], use this property as post-processing step in order to remove incorrect associations or tosuggest probable tag to each face. In [4], K-means algorithm is appliedin several, subsequent steps for identifying persons using both faceand clothing information. Faces are rearranged to enforce the mutualexclusivity constraint during a refinement step. This approach has thelimitation that the number of face clusters has to be specified by theuser.

In [16], active learning has been extendedwith themutual exclusivityconstraint to determine which faces must be tagged first in order to limitthe number of inputs from the user. They propose a probabilistic discrim-inative model to induce a probability distribution over class labels, and aMarkovRandomField to enforce the constraints. However, a considerableuser effort is still required to tag photos. In [17], the mutual exclusivityconstraint is used within an agglomerative hierarchical clustering tofuse clusters whose faces did not belong to the same photo. However,the constraint is not used for evaluating the best association and, ingeneral, the result depends on the order of cluster processing.

An interesting approach is presented in [6], where users areallowed to multi-select a group of photos and assign a name/tag tothe person appearing in them. The method attempts to propagatethe assigned name from photo level to face level, i.e. to infer thecorrespondence between name and face. However, while the user effortfor tagging is minimized, still the user has to manually identify thegroup of photos where that person appears. Moreover, the method isnot able to disambiguate between persons that always appear togetherin the set of photos.

In previous works [18,14], associations across a photo sequenceare found by considering face and clothing features locally within atemporal window and bymeans of a joint probabilistic data association(JPDA) [19]. Basically, associations between identities and depictions ina photo are computed as maximum matching in a bipartite graph [20]by means of the Hungarian algorithm [21]. While the method is ableto cluster depictions in an unsupervised way, it is dependent on theprobability that a new identity is discovered and/or a person is not

Page 3: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

308 L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

depicted in a photo. Such probabilities are difficult to estimate inpractical cases and in general must be manually set by the user. Insuch methods, the face model has not been on-line adapted to avoiddrifting of themodel due to pose changes, and the probability of havinga matching is computed based on the distance between depictionfeatures and identity appearance models.

In this paper, instead, we consider only the face information andtrain an on-line classifier to represent each identity. In contrast to[18] all the associations among identities and depictions are usedwhen updating the identity model, so that the model improves notonly due to the positive associations but also bymeans of the negativeones. The probability of having a matching is computed based on themargin of a face to the decision boundary. Moreover, the new proba-bilistic framework for labeling the data takes into account only theclassifier decisions about each face and it is not dependent on theprobability of discovering a new identity or the probability that anidentity has to reappear in the photo sequence.

Our work is related to semi-supervised learning approaches.Traditional classifiers use only labeled data during the training step.Semi-supervised learning uses large amount of unlabeled data,together with the labeled data, to build better classifiers [22]. In ourframework, we apply a self-training strategy and, therefore, we do notcollect any label from the user but we iteratively build the training setfor each classifier. In practice, we use the mutual exclusivity constraintto infer labels to be used for co-training the classifiers. In the classicalco-training approach [23,24], features are split into two sets that areconditionally independent given the class label. Two separate classifiersare trainedwith the labeled data; each classifier then classifies the unla-beled data, and teaches the other classifier with the unlabeled exam-ples. In this paper, we consider a multi-class problem. A classifier istrained for each identity and the pool of classifiers collaborates to pre-dict the labels for the never seen samples. The set of new labels isused to update the training set and the parameters of each classifier.

In practice, our collaborative training uses a probabilistic frame-work to fuse the decisions coming from the whole set of classifiers;the inferred set of associations is used to label the faces in thecurrently processed photo and to retrain the classifier ensemble.

One of the main benefits of our approach is that the number ofidentities is inferred automatically. Many attempts have been donebefore to determine the best number of clusters to use [17,3]. Thisproblem arises also in many other research fields and methodsbased on Visual Assessment of Cluster Tendency (VAT) [25,26] aresometimes adopted; however, its results are difficult to interpretwhen the clusters are overlapping or are composed of few samples,i.e. the collection is unbalanced. A possible solution consists of consid-ering several values for the number of clusters and then selecting themodel that performs better [27]. However, trying all the possiblepartitions of faces in clusters has a prohibitive complexity and posesmany interesting challenges in case of unbalanced collections andnoisy features.

3. Motivations

Clustering techniques generally require that thenumber of partitionsor some similarity threshold be specified by the user.

K-means is one of the clustering techniques commonly used forphoto album organization. It is an example of competitive learningalgorithm that estimates the centroids of each cluster by minimizingthe within-cluster dissimilarity and computes, as result, a Voronoitessellation of the feature space [28]. K-means is prone to find local-minima [29] and the number of clusters to compute must be knowna priori.

Another very popular clustering method is the agglomerativehierarchical clustering algorithm. It tries to cluster data by abottom-up strategy that iteratively merges clusters looking similarbased on the distances among their members. However, the nearest

point to a sample could be a member of a different cluster because ofnoise or outliers. This makes themethod very sensitive to perturbationsin the data and prone to create clusters with a small size [29].Performance of the method is strongly related to the cut-off thresholdused to determine the partition and the distance function used tocompare different samples. Some challenges are related to spacedimensionality. There are results [29,30] stating that the evaluation ofthe distance becomes less and less meaningful with growing dimen-sionality, when the distances of a point to its farthest neighbor and toits nearest neighbor tend to become equal.

Similar issues are common also to other clustering techniques; forexample mean-shift [31] requires a proper bandwidth selection for akernel function, while quality threshold clustering [32] requiressetting the maximum distance between elements in the same cluster.Therefore, clustering methods seem few attractive for the problemwe aim to solve, where classes are not necessarily linearly separable,there are outliers, and descriptors may have high dimensionality. Inthis paper, we propose an alternative to clustering techniques.On-line learning is used and data are processed in bunches, eachone comprising faces detected in a single photo. We use these facesto train on-line a set of classifiers and estimate an appearancemodel for each identity. For each classifier, the hyperplane is estimatedto guarantee a largemargin of the positive and negative instances to thedecision boundary. In our framework, classifiers collaborate to inferlabels for the new samples. When all the classifiers have a lowconfidence in classifying a sample, a new identity is added. In thisway, the number of identities is not set a priori and this is a veryattractive property for the domain of personal photo collectionorganization.

4. Face association in photo sequences

In our framework, photos are analyzed to detect and align faces,and a descriptor is computed for each face. Photos are sorted basedon the number of faces detected so that photos with a higher numberof detected faces are analyzed first. We maintain an on-line trainedclassifier for each identity. When a set of faces detected in the samephoto need to be processed, it is passed to the classifiers to computethe posterior probability of a match. Such probabilities are used toformulate a data association problem that enforces the mutual exclu-sivity constraint. Once the best set of associations among faces andidentities has been found, it is used to update the classifiers trainedin an on-line fashion. If a face has not been assigned to any identity,then a new identity has been discovered and the other faces in thecurrent photo are used as negative samples to initialize a newon-line trained classifier (see Fig. 1). In the following, we providemore details about each of these processing steps.

4.1. Face detection and processing

In practical cases, face recognition is a very challenging task.Illumination changes, face poses and detection quality affect perfor-mance of any face recognition algorithm. In some cases, false positivesor misdetections can arise, and very often occlusions with objects,hands or surrounding persons increase the noise and the uncertainty.Moreover, person appearance can change over time: different hairstyles, glasses or beard make more uncertain to establish correspon-dences among faces. To get a proper face image, it is necessary toperform subsequent steps to reduce the illumination effects and to seta common coordinate system that accounts for misalignments andslight pose and scale changes. Fig. 2 shows the chain of processingsteps we used to detect and align faces in the photo collection. Weused methods at the state of the art to perform each of these steps.Faces were detected via the Viola–Jones face detector [33]. Faceco-alignment and scaling were achieved by minimizing the entropy ofeach column of pixels through the whole set with the method in [34].

Page 4: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

Fig. 2. Sequence of face processing steps.

309L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

The co-aligned faces were scaled and cropped to 100×100; we experi-mentally found that the information necessary to perform recognitionwith a reasonable precision for the goal of the present paper is stillkept at this resolution. Using a larger scale for the faces did not increasesignificantly the performance of the system. To reduce the effect ofillumination changes, we applied the method by Tan and Triggs [35]and used their code implementation.

4.2. Face description and matching

Principal component analysis (PCA) and Fisher linear discriminant(FLD) based descriptors [36,37] are commonly used to representfaces. However, such descriptors resulted in poor performance inour experiments being them sensitive to pose changes. In [38], LBP-based descriptor was used for face recognition. Local Binary Patterns(LBP) [39] is an operator invariant to monotonic gray levels andcomputationally efficient. This operator is a non-parametric kernelwhich summarizes the local spatial structure of an image. At a givenpixel position, LBP is defined as an ordered set of binary comparisonsof pixel intensities between the center pixel and its eight surroundingpixels. In [38], the histograms of the extracted binary patterns withinsmall regions (sub-windows) are concatenated to form a globaldescriptor of a generic image. When computing each histogram,only 59 different uniform patterns are considered. Depending onthe size of the sub-windows, this descriptor provides different recog-nition performances and the size of the descriptor increases when theregion size decreases. For example, for a face of size 100×100 and a10×10 sub-window, this descriptor provides good performancerecognition but the descriptor is a 5900 dimensional vector, posingchallenges for what it concerns the computational and spacecomplexity of the method.

However, LBP descriptors have the very attractive property ofsparseness which makes possible to apply compression, i.e. PCA [40]or Non-Negative Matrix Factorization (NMF) [41], with very limitedloss of information. In our framework, we applied NMF; this methodrelies on projecting the descriptors in a space where they are repre-sented by lower dimensional length vectors. The matrix of data isrepresented as the product of two non-negative element matricesW and H where the first matrix represents the basis vectors, whilethe second one represents the projection of the data in the newspace represented by W. Since relatively few basis vectors are usedto represent many data vectors, good approximation can only beachieved if the basis vectors discover structure that is latent in thedata. NMF was already used to perform face recognition. Li andZhang [42] showed that, when the dataset consisted of a collectionof face images, the representation consisted of basis vectors encodingfor the mouth, nose, eyes, etc.; that is the most intuitive features offace images. However, we are not working in the face space, but weare applying NMF to the concatenated histograms of LBP computedin sub-regions of the faces; basis vectors represent the latent structureof such descriptors while the projections represent how latent compo-nents contribute to the whole face description.

Inspired by Bosch et al. [43], where a pyramidal histogram ofgradient was presented, we considered the same spatial schema toconstruct a descriptor able to capture also the spatial distribution ofthe LBP. Coarsest level showed to be robust to slight pose changesbut had lower recognition power. At finest level, the descriptorenforces correspondences between the smallest regions of the faces,and showed a higher recognition power but limited robustness topose changes. Tiling the face into regions at multiple resolutions,the resulting pyramidal descriptor is expected to capture bothfeatures within local regions and their global spatial layout. Ourexperiments confirmed well such descriptor performs in practice. In[44], the face image is divided into local regions and LBP descriptorsare extracted from each region independently and concatenated toform a global description of the face. Under this point of view,our LBP descriptor is similar to the method presented in [44] while,at each pyramid level, we compute a concatenated histogramand perform compression independently from the other pyramidlevels.

To find the correspondence between the descriptors of two facesFk and Fj, it is possible to use a kernel function defined as [43]:

K Fk; Fj� �

¼XLi¼1

αi⋅g Fik; Fij

� �ð1Þ

where αi is the weight at level i, g is the similarity function, and Fki and

Fji are the face representations at the i-th pyramid level. As suggested

in [45], each level has been weighted using αi ¼1

2 L−ið Þ where L is the

number of levels and i is the current level.As similarity function g, we used the inner product between the

two descriptors at the i-th pyramid level. Such function performedwell in our experiments but other functions could be used insteadwithout significant performance degradation.

We note that, considering the properties of the inner product, thewhole kernel function in Eq. (1) can be implemented as a simpleinner product of the concatenated histograms computed at eachresolution. To take into account the parameters αi, it is necessary tomultiply each histogram by the square root of the correspondingcoefficient αi. We stress however that this property is true for theinner product kernel function but it is not true in general. For othersimilarity function g, the more general definition in Eq. (1) has to beadopted. Fig. 3 shows all the steps required to implement the facedescriptor using three different granularities.

4.3. Identity model

The main idea beyond our framework is that at the same time wefind associations among faces enforcing the mutual exclusivityconstraint in each photo, and we learn how to discriminate amongidentities considering both the new positive and negative cases.

In a general on-line setting, the system receives observations in asequential manner and predicts an outcome for each one. In the case of

Page 5: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

Fig. 3. Computing the pyramidal face descriptor: the final descriptor is obtained concatenating the LBP descriptors at different granularities after dimensionality reduction (NMF)has been applied.

310 L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

binary classifier, this outcome is simply a yes/no decision (representedby label in {+1,−1} associated to each instance). After the predictionhas been done, the algorithm receives feedback indicating the correctoutcome. Then, the on-line algorithmmodifies its predictionmechanismtrying to improve the prediction accuracy on subsequent rounds. In oursetting, we used a one vs. all linear classifier for each identity to computethe probability of a match between a face and the correspondingidentity. Our method does not receive any feedback from the user andenforces themutual exclusivity constraint to compute themost probableset of associations. The inferred associations and the observations arethen used to online update all the classifiers: associated faces areconsidered as positive instances, while the others are the negativeones.

4.3.1. Learning on-line the identitiesEach linear classifier is defined as a function in the parameter

w and, for each sample xt, it predicts a label y∗ computed asy∗=sign(w ⋅xt), and the magnitude |w⋅xt| is the margin associatedto the instance xt. In our framework, we need to update on-line theclassifiers while new labels are estimated. As the label we get isuncertain, we need a method that takes into account the new labelbut does not forget the previous ones. We adopted the very simpleschema proposed in [46], where on-line passive–aggressive(PA) classifiers were presented first. In the PA algorithm, the updatingis performed by solving a constrained optimization problem that triesto maintain the classifier as close as possible to the current solution –

so as not to forget its past experience – while achieving at least aunit margin on the new instances. If the margin is less than 1, theclassifier suffers an instantaneous loss, defined by the hinge-lossfunction:

Loss w; x; yð Þ ¼0 if y⋅ wT ⋅x

� �≥ 1 2ð Þ

1−y⋅ wT ⋅x� �

otherwise

8<: ð2Þ

Training the on-line PA classifier requires setting, on round t, thenew weight vector wt+1 to be the solution to the followingconstrained optimization problem:

wtþ1 ¼ argmin12⋅‖w−wt‖

2� �

s:t: Loss w; xt ; ytð Þ ¼ 0: ð3Þ

The PA algorithm is passive whenever the hinge-loss is zero beingwt+1=wt; it is aggressive because, whenever the hinge-loss is positive,the algorithm forceswt+1 to satisfy the constraint Loss(wt+1,xt,yt)=0,that is, it forces the classifier on having a margin of at least 1 on thenew sample. This problem has a simple closed-form solution and theiterative schema for optimizing parameters is [46]:

wtþ1 ¼ wt þ τtytxt ð4Þ

where

τt ¼ max 0;Loss wt ; xt ; ytð Þ

‖xt‖2

� �: ð5Þ

In the dual space, the predicted label at round t+1, considering akernel function K(⋅,⋅) (see Eq. (1)), is:

y� ¼ sign wt⋅xtþ1� ð6Þ

¼ signXtz¼1

τzyzK xz; xtþ1� !

: ð7Þ

For each identity i we use the faces detected in each photo toupdate the classifier in the dual space and learn the correspondingparameters τi. Then, we can use Eq. (7) to classify each new incomingface instance.

When updating the classifiers, we need to consider that the trainingset is unbalanced. In [47], it is treated the problem of training a classifierwith an unbalanced training set where the positive class is under-

Page 6: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

311L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

represented compared to the negative one. The authors showed that itis possible to improve the performance applying a one-sided selection,which is filtering samples only from the negative class using, for eachpositive instance, the nearest negative samples. However, in our frame-work, labels are uncertain and are collected incrementally. Then, toimplement an Instance Filtering strategy, we reason about theconfidence of the classifier on the instance. Therefore, we update eachclassifier considering all the positive instances, and only the negativeinstances for which the classifier showed to be more confused, that iswe considered only the negative instances for which p(y∗=−1|xt,ci)is lower than a fixed threshold Pth. This is an alternative way to considerthe negative samples that are near to the positive class.

4.3.2. Initializing the identitiesOn-line PA schema poses some challenges for what concern the

initialization step. In our setting, the initial pool of labels for eachidentity can be easily obtained considering the mutual exclusivityconstraint. This set of faces is in general unbalanced and there isone positive sample and several negative samples. This proved to bevery challenging particularly in the initialization step, when theclassifier is added because a new identity has been discovered. Ourexperiments showed that initializing the parameters via on-line PAalgorithm when too many negative instances are present generallyconfuses the classifier due to overfitting. On the other hand, thisinitial set is composed of certain label and gives a first rough partitionof the feature space. To initialize the classifiers avoiding the loss ofmeaningful information, the initialization step has been performedoff-line solving a quadratic optimization problem that minimizesthe norm of the parameters τ constrained by the fact that the lossneeds to be 0, and the parameters need to be non negative. In thisway we guarantee that the classifier is able to correctly discriminateamong the faces in the first photo where the associated identity hasbeen discovered. In case of photos where only a face has beendetected, if this face is a new identity, then we need to provide a setof negative samples to train the classifier. We select the k farthestfaces from the previously already processed. As similarity measureto compute such faces we used Eq. (1).

4.3.3. Estimating the probability of a matchIn our framework we are not interested in the predicted identity

but, instead, we need the probability that the classified face matchesthe identity. We need to transform the classifier's outcome in theposterior probability of a match. We applied the method proposedin [48] and used a sigmoid function to compute such probability.Given a classifier ci corresponding to the i-th identity, the followingequation:

p y� ¼ 1� xtþ1; ciÞ ¼

11þ exp −wi

t⋅xtþ1� ð8Þ

¼ 11þ exp −∑t

z¼1 τizy

izK xz; xtþ1� � ð9Þ

provides the probability that the face xt matches the i-th identity. Aclassifier similar to what we used corresponds to a logistic regressor.The common way to train a logistic regressor is by maximizing theposterior of the whole set of data [49]. Here, we are instead adoptingthe PA schema and then, as also detailed in Section 4.3.1, the logisticregressor is trained on forcing the margin on new instances to be 1 asoften as possible.

4.4. Face classification and identity association

Assuming that N identities were already discovered and there areP faces to classify in the current photo, our goal is to find the bestassociations between such P faces and the N identities enforcing the

mutual exclusivity constraint. If no match is found for a face, a newidentity is discovered and a new classifier is initiated and added.

We define C={cu}u=1N as the set of trained classifiers, and Vi, j as

the random binary variable indicating that the j-th face Fj matchesthe i-th identity, given the face and the classifier ci. The correspondingprobability can be computed using Eq. (9).

Let xi, j a binary variable that assumes value 1 if Fj is associated to theidentity i. The probability of such association can be defined in terms ofthe set of binary variables {Vu, j}u=1

N where Vi, j=1, and Vk, j=0 for everyk≠ i. Assuming that the set of variables {Vu, j}u=1

N is independent, then

p xi;j ¼ 1� Fj;CÞ ¼ p Vi;j ¼ 1

� Fj; ciÞ⋅ ∏k≠i

p Vk;j ¼ 0� Fj; ckÞ: ð10Þ

Similarly, we define the binary variable xN+1, j that assumes value1 if Fj corresponds to a new identity (that is, it does not match anyidentity). Such probability can be modeled by considering the set ofbinary variables {Vk, j}k=1

N where Vk, j=0 for every k, that is:

p xNþ1;j ¼ 1� Fj;CÞ ¼ ∏

N

k¼1p Vk;j ¼ 0� Fj; ckÞ: ð11Þ

The best set of associations between faces and identities can befound by maximizing the association joint probability J (definedlater in Eq. (12)) considering that:

• each face can be assigned to only an identity or, eventually, can be anew identity;

• each identity can be associated to at most a face as at each step weprocess a set of faces from the same photo.

Optimal associations can be found by solving a constrainedoptimization problem. We have to maximize the association jointprobability J, that is the product of the above defined probabilities(or, to avoid numerical instability, the sum of their logarithms) withrespect to the (N+1)×M binary variables xi, j. The function J(x) tomaximize is:

J xð Þ ¼XNþ1

i¼1

XPj¼1

log p xi;j ¼ 1� Fj;C� �

Þ⋅xi;j ð12Þ

constrained by:

XPj¼1

xi;j ≤ 1∀ i ∈ 1;N½ � ð13Þ

XNþ1

i¼1

xi;j ¼ 1 ∀ j: ð14Þ

The first constraint implies that each identity is associated to atmost a face. However, no constraint is imposed on the variablesxN+1, j being possible to find more than a new identity in eachphoto. The second constraint implies that each face is associatedeither to only an identity or to a new identity.

Being the variable xi, j binary, this is a binary integer problem wesolved by a linear programming (LP)-based branch-and-boundalgorithm [50]. Based on the resulting optimal associations, we associateto each face a label corresponding to the associated identity and we usethem to update the classifiers or add some new identities.

5. Experimental results

We performed our experiments on a publicly available dataset[15] and on a photo sequence generated from the dataset [51]. Wecompared results of our approach to K-means and to hierarchicalclustering. These two methods are the most used in such kind of

Page 7: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

312 L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

applications, i.e. [3,4] but present some limitations when adopted tosolve our problem, as explained in Section 3. Post-processing stepsaccounting for similarity in clothing or time segmentation can beadopted in our system to increase its accuracy. However, here we areinterested in showing how our semi-supervised approach performsusing no other information but face descriptions and we show perfor-mance against clustering methods.

In the following, first we describe datasets used in this paper, andthen we motivate, by means of experiments, using pyramidalapproach for face description purposes. In Section 5.3, we describehowwemeasure the performance of the method and of the clusteringtechniques. In Section 5.4, we report results of our semi-supervisedface association, while in Section 5.5 we consider a supervised faceassociation approach. We trained a set of classifiers off-line andcompared two strategies: associating the most probable identity toeach face and enforcing the mutual exclusivity constraint.

5.1. Dataset

The public Gallagher dataset [15] is composed of 589 photos takenin different days during a period of about 6 months. From the groundtruth, we know that 32 identities are present. While we noted thatsome faces were not annotated at all in the dataset, we used thesame ground truth provided along with the dataset. Such groundtruth annotates each face by means of the eye positions. We appliedthe face detector [33] on the dataset and assigned to each detectedface the identity from the ground truth corresponding to the personwhose eyes are inside the detected face itself. Our face detectorfound 1040 facial images, but only 813 faces had an associated identity.The remaining detected faces were treated as false positives.

Finding large publicly available dataset is challenging. Instead ofpresenting results on private collections, we generated a “virtual”photo sequence. We used the public dataset “Faces in the Wild”(LFW) [51] as source of faces. In this dataset there are 5749 identitiesand more than 13,233 already aligned faces. We used this dataset toautomatically generate a virtual photo sequence. In practice, a virtualphoto is a set of real faces. We generated 500 virtual photos com-posed of 1593 different faces. To generate these photos, we selecteda set of 30 identities randomly sampled from the identities that hadmore than 15 faces associated. For each virtual photo we randomlyselected from the pool of selected faces a number of faces rangingfrom 1 to 15.

5.2. Benefits of using a pyramidal approach for face description

To test our face descriptor, we used a subset of 55 different identi-ties from the LFW dataset. For each identity we randomly selected 15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e Po

sitiv

e R

ate

0.1

0

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tru

e Po

sitiv

e R

ate

0 0.1 0.2 0.3 0.4 0.5

False Posi

LBP computed on 10 × 10 sub-windows LBP computed in 20

NMF 20NMF 40NMF 60NMF 80NMF 100

Fig. 4. Performance applying NMF to the LBP descriptor computed

face images. The dataset generated in this way was composed of 825face images. We measured the true positive rate vs. the false positiverate for 100 randomly selected images. The remaining 725 imageswere used to train a set of 55 logistic regressor classifiers, and theprobabilities provided by them were thresholded to get ROC curves.

To extract features from the face image, we computed the LBPhistograms in sub-windows of sizes 10×10, 20×20 and 100×100(that is the whole image).

In this work we are using a pyramidal approach to represent theLBP descriptors at different granularities. The length of the descriptorsat each pyramidal level is 5900, 1475, 59, respectively, and this makesimpractical to learn classifiers on unprocessed data. We then applieddimensionality reduction on each of these granularities. Applyingdimensionality reduction permits to drastically reduce the timerequired for training the classifiers and increases performance, beingthe model to learn simpler. Popular methods for dimensionalityreduction are principal component analysis (PCA) and Non-negativeMatrix Factorization (NMF).

We applied NMF at each level of the pyramid. First, we performedpreliminary experiments to choose the number of components to useat each level of the pyramid. Fig. 4 shows a comparison of averageperformance on 100 randomly generated partitions for descriptorswith length L – after dimensionality reduction – and L∈{10,20,⋯100}.

As the graph shows, performance for the first two granularities isgenerally very close to 40 or more components. For L greater than 40,the true positive rate slightly increases. We used 100, 50 and 20components for windows of sizes 10×10, 20×20 and 100×100,respectively, considering the highest true positive rate at zero falsepositive rate. Fig. 5 reports the average performance of each singlegranularity and of our pyramidal approach, and it clearly shows thatour pyramidal approach improves the recognition power of thedescriptor.

In previous works [52–54], LBP has been used jointly with PCA toperform a dimensionality reduction. However, the main problem is tofind a good number of components to use for the descriptor. Thenumber of components used in previous papers ranges between 100and 200. We also evaluated a PCA approach for dimensionality reduc-tion with the same number of components used for the NMF, i.e. 100,50 and 20. Fig. 6 plots the performance of each granularity showingthat PCA has performance comparable to NMF at low false positive rate.

5.3. Performance measures and comparison

To evaluate the performance of our algorithm, we associated toeach person-cluster the predominant identity we got from the groundtruth. Wemeasured themisclassification rate considering the number

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tru

e Po

sitiv

e R

ate

0.6 0.7 0.8 0.9 1

tive Rate False Positive Rate

× 20 sub-windows

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

LBP computed on the whole image

NMF 10NMF 30NMF 50NMF 70NMF 100

NMF 10NMF 30NMF 40NMF 50NMF 20

in sub-windows of size 10×10, 20×20 and 100×100 pixels.

Page 8: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e Po

sitiv

e R

ate

NMF for 10×10

NMF for 20×20

NMF for 100×100

NMF Pyramid 3 levels

NMF Pyramid 2 levels

Fig. 5. Performance of LBP+NMF. Curves were computed at different granularities andfor the pyramidal approach. The number of components when using NMF was 100, 50and 20 respectively for sub-windows of sizes 10×10, 20×20 and 100×100. Weconsidered two different kinds of pyramid: a 3 level pyramid with all the granularities,and a two level pyramid that considers only the granularities corresponding tosub-windows of size 10×10, and 20×20. The figure shows that the third level actuallyslightly increases the performance of the method.

Table 1Misclassification rates (%) on the Gallagher dataset. The lowest misclassification ratehas been highlighted.

Methods No FPs No FPs With FPs With FPs

PTh=0.7 PTh=1 PTh=0.7 PTh=1K-means TR=1 34.1±1.35 34.1±1.35 33.81±1.13 33.81±1.13H. Clust TR=1 38.49 38.49 38.26 38.26K-means 32.86±1.12 31.64±1.11 31.27±0.72 30.81±0.83H. Clust 36.53 33.33 33.4 33.1Ours 27.68 31.26 28.66 30.75TR 1.22 1.56 1.62 1.7

313L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

of misclassified faces in the whole photo sequence over the totalnumber of detected faces.

To measure how false positives in face detection (FPs) affect theperformance, we computed the misclassification rate with and withoutFPs. In the last case, FPs are identified bymeans of the ground-truth andtaken off from the set of detected faces.

For evaluating the capability of the method to automatically com-pute the correct number of persons in the photo collection,we computethe track ratio (TR) that is the ratio between the number of detectedidentities and the true number of identities in the collection. Asmisclas-sification rate is computed relatively to the predominant identity of thecluster, the misclassification rate tends to decrease in case of over-clustering (in the limit case if each cluster is made of only a face themisclassification rate is 0%). To make a fair comparison, we also com-puted the misclassification rate for the other two clustering methodsusing a number of clusters equals to the number of identities we auto-matically found with our technique. In this way the misclassificationrate ismeasured on equal terms of track ratio. Finally, as K-means is ran-domly initialized, we present an average performance over 100 runsand report the standard deviation of the measured misclassificationrate.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e Po

sitiv

e R

ate PCA100

PCA50

PCA20

Fig. 6. Performance of LBP+PCA. Curves were computed at different granularities with100, 50 and 20 components for sub-windows of size 10×10, 20×20 and 100×100respectively.

5.4. Semi-supervised face association

We now present results of our semi-supervised approach on theGallagher dataset and on a virtual photo sequence generated fromfaces in the wild dataset.

5.4.1. Results on the Gallagher datasetTable 1 summarizes the results obtained with our method and

with clustering techniques on the Gallagher dataset. As the tableshows, our method discovers a reasonable number of identities andit generally outperforms clustering techniques on equal terms of trackratio. We stress here that, on the contrary of clustering techniques,our method does not require prior knowledge about the number ofclusters to compute. Not only our method is able to discriminateamong several identities but it also automatically discovers new identi-ties appearing in the photo collection as described in Section 4.4. Thetable shows results obtained setting the probability threshold PTh to0.7 and to 1 (see Section 4.3.1); in the last case, in practice we are con-sidering all the negative instances when updating the classifiers.Experiments show how considering all the negative instances contrib-utes to confuse the classifiers, while instance filtering improves theclassification results.

Looking more closely to the clusters, we observed that K-meansfound clusters with in average the same size and several clusterswere assigned to the most numerous identities. In contrast, ourapproach is able to find clusters associated to more rare identities.For all the methods, some clusters have a higher confusion, due tostrong pose changes.

In Table 2, we report the results on the Gallagher dataset whenusing the pyramidal descriptor and a single granularity (we usedLBP computed on 10×10 sub-windows). We also report results forthe method in [18], where faces were described by eigenfaces. Theresults show that not only the pyramidal approach increases thediscrimination capability for the LBP descriptor, but also for the singlegranularity there is a limited track-ratio value. This suggests that theframework we are proposing in this work is more robust than the onepresented in [18], being the track-ratio reduced.

We evaluated the impact of the hyper-parameter k introduced inSection 4.3.2 by performing experiments for k∈ [1,10]. In practice, kcontrols the number of negative samples selected to initialize aclassifier. We noted that the performance of the method and thetrack-ratio are not significantly affected by the value of k. This is

Table 2Comparison on the Gallagher dataset. The lowest misclassification rate has beenhighlighted.

Methods No FPs With FPs

Method in [18] 30.1 33.85TR [18] 9.5 11.37Ours with pyramid 27.68 28.66TR — ours, pyramid 1.22 1.62Ours — single gran. 33.58 31.24TR — ours, single gran. 1.09 1.84

Page 9: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

Fig. 7. Each row in the image represents faces belonging to the same cluster computed with our method. On the left, we report the ID that was assigned to the cluster whencomputing performance. Red bordered faces correspond to wrong associations.

Table 4

314 L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

probably due to the fact that, when selecting such faces, we considerthe k farthest from the positive case and probably they will have ahigh margin being very different from the positive class. The parame-ter however affects the computational time required to process suchfaces. In our experiments we set k to 3.

Fig. 7 shows some faces extracted from the clusters we got withour method. Red bordered faces correspond to misclassified faces.

5.4.2. Results on the faces in the wild datasetTable 3 shows the results on the virtual photo sequence. On this

dataset, our method discovered many different little clusters, this isthe reason why the track ratio increased. Analyzing the results, wevisually observed that some faces with strong pose change or withobjects occluding themselves were not correctly associated andinstead were used to initiate a new classifier. This could bias theresults because over-clustering tends to reduce the misclassificationrate. We then computed the misclassification rate only for thoseclusters withmore thanN faces. As stated in Section 5.1, the generateddataset is composed of 1593 faces of 30 different individuals. Weobserved that of the 178 identities found, only 68 associated clusters

Table 3Misclassification rates (%) on the virtual photo sequence. The lowest misclassificationrate has been highlighted.

Methods PTh=0.7 PTh=1

K-means TR=1 36.28±0.86 36.28±0.86H. Clust TR=1 41.43 41.43K-means 26.39±0.53 27.36±0.61H. Clust. 24.39 27.98Ours 22.15 25.21TR 5.93 5.3

had more than 3 faces (composed of 1428 faces), only 35 more than10 (composed of 1210 faces) and only 12 more than 30 faces(composed of 780 faces). The misclassification rates for only theseclusters were 23.88%, 20.41% and 13.08% respectively. Thesemeasuresshow that our method is able to construct populated and consistentface groups, and much of the over-clustering is due to small clustersgrouping faces difficult to associate to other identities. This suggeststhat our framework handles outliers by adding new identities andwithout confusing the already estimated identity models.

5.5. Comparison with supervised face association

Finally, we compare unsupervisedmethods (K-means, hierarchicalclustering and data association) and supervised methods. Werandomly split the 500 generated virtual photos into two sets. Weonly checked that all the identities were present in both sets. We usedthe set composed of 200 photos (with 630 faces) to train a logisticregressor for each identity by using the Newton–Raphson method

Unsupervised vs. supervised. The lowest misclassification rate has been highlighted.

Unsupervised methods Mis-cl.

K-means TR=1 37.46±1.14H. Clust. TR=1 38.5K-means 25.68±1.21H. Clust 24.57Ours 22.43TR 5.16

Supervised methods Mis-cl.Log. Reg.+MS TR=1 20.46Log. Reg.+ME TR=1 10.98

Page 10: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

315L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

[49]. Then we used the remaining 300 photos (with 963 faces) as a testset.

To cluster faces by using supervised methods we adopted twostrategies: Maximum Score (MS) and Mutual Exclusivity (ME). MScorresponds to classify a face at a time using the set of logistic regres-sors and associating each face to the identity providing the maximumprobability of a match. ME, instead, consists in solving the data asso-ciation problem formulated as explained in Section 4.4 but avoidingthe possibility to add a new identity. Table 4 summarizes the results.Supervised methods perform better but, of course, they require muchmore user interactions to get the annotations necessary to train theclassifiers. Moreover, using such methods a strategy is required toaccount for the introduction of new identities. Experiments show thatenforcing the Mutual Exclusivity constraint is a successful strategyand also in the case of supervised method it considerably reduces themisclassification rate.

5.6. Complexity of the proposed method

As regards time complexity, considering a worst case sequence ofk photos with N faces for each one, and N individuals in the wholesequence, we would have to cluster k ⋅N faces. In the following, wecompare the time complexity of clustering methods against ourapproach. The complexity of the face detection and representationis the same for all the methods and depends on the adopted facedescriptor and dimensionality reduction technique.

It is well known that hierarchical clustering has a time complexityof O((k⋅N)2 ⋅ log(k ⋅N)) [55] while, consideringM iterations, K-meanshas a time complexity of O(M⋅N ⋅(k⋅N)), that is, O(M⋅k⋅N2) [55].

For each photo, our technique requires to solve an associationproblem of complexity O(f(N)), and to update N identities with theN detected faces. It is well known that 0–1 integer programming isNP-complete, and complexity depends on the branch&boundapproach we adopted. Therefore, in the worst case, the cost O(f(N))is exponential; however in general Nbbk and in practical case theproblem is tractable. Considering the cost of the data associationproblem and the identity updating, the complexity of processing thewhole set of k photos is O(k⋅(f(N)+N2)). Being the complexity linearin k, the framework scales well in the size of the photo collection. Asconcern the number of identities, it is assumed that in personal photocollections the number of depicted individuals is limited; therefore,the complexity is in general manageable.

As for the space complexity of each identity model, in the primalspace it would be equal to the length of the classifier parametervector w. In the dual space, considering the worst case in whichthere are N identities appearing in each of the k photos, the modelwill be composed of all the face images detected in the photosequence for which τ is not zero (that is the margin is less than 1).In practice, the model would consist of k ⋅N faces, and of the corre-sponding set of parameters τs. This case could arise when the classesare so overlapping that a linear classifier cannot separate them. In realcases, however, the size of the model is much lower because severalfaces are correctly classified with a margin higher than 1. In thiscase, as τ would be zero, the corresponding face is not included inthe model; moreover, the instance filtering described in Section 4.3.1further reduces the number of negative samples to include in themodel.

We performed our experiments using a Matlab prototype on aIntel Core 2 Duo 2.53 GHz.

On the Gallagher dataset, K-means took on average approxima-tively 6 s while hierarchical clustering took about 25 s. Our method,instead, requires in average 80 s to group all the faces. On the virtualphoto sequence, where we found an higher number of identities andthere are almost twice the faces to analyze, K-means took on averageapproximatively 20 s while hierarchical clustering took about 2 min.Our method, instead, requires in average 12 min.

Even though our approach is computationally more expensivethan clustering techniques, we believe that the difference in compu-tation time could actually be much smaller than reported. In fact, inour experiments, we used optimized implementations of K-meansand hierarchical clustering techniques but we used a simple unopti-mized implementation of the proposed technique.

6. Conclusions and future works

In this paper we presented an automatic method that relies on on-line learning techniques to discriminate among faces by updating aset of PA classifiers. We showed how face grouping can be formulatedas a semi-supervised learning problem that enforces the mutualexclusivity constraint, namely our method associate faces amongphotos considering that each face can be associated only to an identityand, vice versa, each identity can be associated at most to one of thefaces detected in the image. In our framework, each identity ismodeled by means of an on-line trained classifier without any user'sinput. Faces are processed iteratively and whenever a new identity isdiscovered, a new classifier is trained. We tested our system on apublic available dataset and we also generated photo sequences formore extensive experiments. Our experiments showed that themethodis able to cluster faceswithout any a priori information about the collec-tion, on the contrary of the other work at the state-of-the-art thatrequires the user specify the number of groups to build and/or a set oflabels to use for tagging. In future works, we will extend our approachto consider other cues to find associations along the photo sequenceas, for example, clothing information and time/event segmentation.We will also investigate the use of Multiple Instance Learning [7] toon-line learn to discriminate between faces.

Acknowledgments

We thank all the anonymous reviewers and the associate editorwhose insightful comments led to significant improvements of themanuscript.

References

[1] iPhoto, http://www.apple.com/ilife/iphoto 2002.[2] R. Jain, P. Sinha, Content without context is meaningless, Proc. of Conf. on Multi-

media (MM), ACM, 2010, pp. 1259–1268.[3] J. Choi, S. Yang, Y. Ro, K. Plataniotis, Face annotation for personal photos using

context-assisted face recognition, Proc. of Int. Conf. on Multimedia InformationRetrieval (MIR), ACM, 2008, pp. 44–51.

[4] W.-T. Chu, Y.-L. Lee, J.-Y. Yu, Using context information and local feature points inface clustering for consumer photos, Proc. of Int. Conf. on Acoustics, Speech, andSignal Processing (ICASSP), IEEE, 2009, pp. 1141–1144.

[5] E. Ardizzone, M. La Cascia, F. Vella, A novel approach to personal photo albumrepresentation and management, in: Proceedings of Multimedia Content Access:Algorithms and systems II, IS&T SPIE Symposium on Electronic Imaging, volume6820, 2008, p. 682007.

[6] L. Zhang, Y. Hu, M. Li, W. Ma, H. Zhang, Efficient propagation for face annotationin family albums, Proc. of Conf. on Multimedia (MM), ACM, 2004, pp. 716–723.

[7] B. Babenko, M. Yang, S. Belongie, Visual tracking with online multiple instancelearning, Proc. of Conf. on Computer Vision and Pattern Recognition (CVPR),IEEE, 2009, pp. 983–990.

[8] H. Lu, W. Zhang, Y. Chen, On feature combination and multiple kernel learning forobject tracking, Computer Vision—ACCV 2010, Springer, 2011, pp. 511–522.

[9] F. Tang, S. Brennan, Q. Zhao, H. Tao, Co-tracking using semi-supervised supportvector machines, Proc. of International Conference on Computer Vision, (ICCV),IEEE, 2007, pp. 1–8.

[10] T. Kim, W.T.B. Stenger, R. Cipolla, Online multiple classifier boosting for objecttracking, Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30,2008, pp. 1257–1269.

[11] L. Zhang, L. Chen, M. Li, H. Zhang, Automated annotation of human faces in familyalbums, Proc. of Conf. on Multimedia (MM), ACM, 2003, pp. 355–358.

[12] H. Kang, B. Shneiderman, Visualization methods for personal photo collections:browsing and searching in the photofinder, International Conference on Multi-media and Expo, (ICME), volume 3, IEEE, 2000, pp. 1539–1542.

[13] Picasa, http://picasa.google.com 2004.[14] L. Lo Presti, M. Morana, M. La Cascia, A data association approach to detect and orga-

nize people in personal photo collections, Multimedia Tools and Applications, 2011.

Page 11: Image and Vision Computing - UniPa · 2013-12-06 · 1. Introduction Personal photo albums show characteristics that make them quite different from a generic image database. In general,

316 L. Lo Presti, M. La Cascia / Image and Vision Computing 30 (2012) 306–316

[15] A. Gallagher, T. Chen, Clothing cosegmentation for recognizing people, Proc. ofComputer Vision and Pattern Recognition (CVPR), IEEE, 2008, pp. 1–8.

[16] A. Kapoor, G. Hua, A. Akbarzadeh, S. Baker, Which faces to tag: adding prior con-straints into active learning, Proc. of Int. Conf. on Computer Vision, IEEE, 2009,pp. 1058–1065.

[17] L. Gu, T. Zhang, X. Ding, Clustering consumer photos based on face recognition,Proc. of Int. Conf. on Multimedia and Expo, IEEE, 2007, pp. 1998–2001.

[18] L. Lo Presti, M. Morana, M. La Cascia, A data association algorithm for people re-identification in photo sequences, Proc. of Int. Symposium on Multimedia(ISM), IEEE, 2010, pp. 318–323.

[19] Y. Bar-Shalom, Tracking and Data Association, Academic Press Professional, Inc,1987.

[20] J. Gross, J. Yellen, Graph Theory and Its Applications, CRC press, 2006.[21] J. Munkres, Algorithms for the assignment and transportation problems, J. Soc.

Ind. Appl. Math. (1957) 32–38.[22] X. Zhu, Semi-Supervised Learning Literature Survey, Technical Report 1530, Com-

puter Sciences, University of Wisconsin-Madison, 2005.[23] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, Proc.

of Conf. on Computational learning theory, ACM, 1998, pp. 92–100.[24] T. Mitchell, The role of unlabeled data in supervised learning, in: Proc. of Int. Col-

loquium on Cognitive Science, 1999, pp. 2–11.[25] L. Wang, C. Leckie, K. Ramamohanarao, J. Bezdek, Automatically determining the

number of clusters in unlabeled data sets, Trans. Knowl. Data Eng. (2008) 335–350.[26] L. Wang, X. Geng, J. Bezdek, C. Leckie, K. Ramamohanarao, Enhanced visual anal-

ysis for cluster tendency assessment and data partitioning, Transactions onKnowledge and Data Engineering (2009) 1401–1414.

[27] S. Prince, J. Elder, Bayesian identity clustering, Proc. of Canadian Conference onComputer and Robot Vision (CRV), IEEE, 2010, pp. 32–39.

[28] R. Duda, P. Hart, D. Stork, Pattern Classification, volume 2, John Wiley & Sons,2001.

[29] F. Masulli, A new approach to hierarchical clustering for the analysis of genomicdata, Proc. of Int. Joint Conf. on Neural Networks (IJCNN), volume 1, IEEE, 2005,pp. 155–160.

[30] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is Ònearest neighborÓmeaningful?, Database TheoryÑICDT, 1999, pp. 217–235.

[31] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space anal-ysis, Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2002,pp. 603–619.

[32] L. Heyer, S. Kruglyak, S. Yooseph, Exploring expression data: identification andanalysis of coexpressed genes, Genome Res. 9 (1999) 1106–1115.

[33] P. Viola, M. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2004)137–154.

[34] G.B. Huang, V. Jain, E. Learned-Miller, Unsupervised joint alignment of compleximages, Proc. of Int. Conf. on Computer Vision (ICCV), IEEE, 2007, pp. 1–8.

[35] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition underdifficult lighting conditions, Proc. of Int. Conf. on Analysis and Modeling ofFaces and Gestures, Springer-Verlag, 2007, pp. 168–182.

[36] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci. 3 (1991)71–86.

[37] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognitionusing class specific linear projection, Trans. on Pattern Analysis and Machine In-telligence (PAMI), 19, 2002, pp. 711–720.

[38] T. Ahonen, A. Hadid, M. Pietikäinen, Face recognition with local binary patterns,Proc. of European Conf. on Computer Vision (ECCV), Springer, 2004, pp. 469–481.

[39] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation in-variant texture classification with local binary patterns, Transactions on PatternAnalysis and Machine Intelligence (PAMI), 2002, pp. 971–987.

[40] I. Jolliffe, Principal Component Analysis, Wiley Online, Library, 2002.[41] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, Proc. of

Advances in Neural Information Processing (NIPS), volume 13, MIT Press, 2001,pp. 556–562.

[42] S.Z. Li, X. Hou, H. Zhang, Q. Cheng, Learning spatially localized parts-based repre-sentations, Proc. Conf. Computer Vision and Pattern Recognition (CVPR), volume1, IEEE, 2001, pp. 207–212.

[43] A. Bosch, A. Zisserman, X. Munoz, Representing shape with a spatial pyramid ker-nel, Proc. of Int. Conf. on Image and Video Retrieval (CIVR), ACM, 2007,pp. 401–408.

[44] T. Ahonen, A. Hadid, M. Pietikäinen, Face description with local binary patterns:application to face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006)2037–2041.

[45] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid match-ing for recognizing natural scene categories, Proc. of Conf. on Computer Visionand Pattern Recognition (CVPR), volume 2, IEEE, 2006, pp. 2169–2178.

[46] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passive–ag-gressive algorithms, J. Mach. Learn. Res. 7 (2006) 551–585.

[47] M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-sidedselection, Proc. of Conf. in Machine Learning, Morgan Kaufmann Pub, 1997,pp. 179–186.

[48] T. Hastie, R. Tibshirani, Classification by pairwise coupling, Ann. Stat. 26 (1998)451–471.

[49] T. Minka, A comparison of numerical optimizers for logistic regression, Unpub-lished draft (2003).

[50] F.S. Hillier, G.J. Lieberman, Introduction to Operations Research, McGraw-Hill,2001.

[51] G. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: a data-base for studying face recognition in unconstrained environments, Technical Re-port 07, University of Massachusetts, Amherst, 2007.

[52] Y. Taigman, L. Wolf, T. Hassner, Multiple one-shots for utilizing class label infor-mation, in: British Machine Vision Conference (BMVC), volume 2, 2009, pp. 1–12.

[53] H. Nguyen, L. Bai, Cosine similarity metric learning for face verification, Proc. ofAsian Conf. on Computer Vision—ACCV, Springer, 2011, pp. 709–720.

[54] S. Prince, P. Li, Y. Fu, U. Mohammed, J. Elder, Probabilistic models for inferenceabout identity, Trans. Pattern Anal. Mach. Intell. (2011) 144–157.

[55] A. Jain, M. Murty, P. Flynn, Data clustering: a review, ACM computing surveys(CSUR), 31, 1999, pp. 264–323.