unsupervised learning of dense hierarchical … · unsupervised learning of dense hierarchical...

Unsupervised Learning of Dense Hierarchical Appearance Representations

Fabien Scalzo and Justus H. PiaterMontefiore Institute, University of Liege, 4000 Liege, Belgium

{fscalzo,Justus.Piater}@ulg.ac.be

Abstract

We describe an unsupervised, probabilistic method forlearning visual feature hierarchies. Starting from lo-cal, low-level features computed at random locations, themethod combines features hierarchically. At each level ofthe hierarchy, pairs of features are identified that tend tooccur at stable positions relative to each other, by cluster-ing the configurational distributions of observed feature co-occurrences using Expectation-Maximization. Stable pairsof features thus identified are combined into higher-levelfeatures. This learning scheme results in a graphical modelthat constitutes a probabilistic representation of a flexiblevisual feature hierarchy. For detection, evidence is prop-agated using Nonparametric Belief Propagation, a recentgeneralization of particle filtering. In experiments, the pro-posed approach demonstrates effective learning and robustdetection of objects in the presence of clutter and occlusion.

1. Introduction

Most natural and human-made objects have rigid localshape and appearance, but often present more variabilityat larger scales. Hierarchical approaches to object recogni-tion have recently received increasing attention [1, 9, 2, 7].They are well suited to represent shape variability at differ-ent scales and granularities within a single coherent frame-work. The current paper proposes an unsupervised genera-tive method for learning hierarchical models that integratespatial constraints between features [4]. We use a graphicalmodel to represent the hierarchical feature structure (Sec-tion 2). In such a representation, each node correspondsto a visual feature class and is annotated with its possiblelocations in the image. The edges model both the spatial ar-rangement and the statistical dependence between two fea-ture classes.

As was mentioned by G. Bouchard and B. Triggs [1],nearby object features have strongly correlated positionswhile more distant features are much more weakly corre-lated. The proposed learning framework (Section 3) fo-cus on modelling correlations between features. It accumu-

lates statistical evidence from feature observations in orderto find feature classes that are often detected in the sameneighborhood. Such features are said to be spatially corre-lated. The graph is incrementally built by combining corre-lated features into higher-level abstractions.

Detection (Section 4) is achieved by using Nonparamet-ric Belief Propagation [8], a message-passing algorithm, toinfer the belief associated with higher-level feature classes,that are not directly observable. Finally, the functioning andthe efficacy of our method are demonstrated on both artifi-cial and real image data sets (Section 5).

Figure 1. Object part decomposition (left) andcorresponding graphical model (right).

2. A Hierarchical Representation of Features

In order to represent a hierarchy of visual features, a flex-ible statistical framework is required. A graphical model isperfectly suitable for this task (Figure 1). Nodes representfeature classes and are annotated with the detection infor-mation for a given scene. Edges represent two types of in-formation; the distance between features and their hierar-chical composition. The formulation in terms of graphicalmodels is attractive because it provides a statistical modelof the variability of shape and appearance. These are spec-ified separately by the edges and the low-level nodes of thegraph, respectively.

The vertices s ∈ V of the graph represent featureclasses. They contain the feature activations for a givenimage. The graphical model associates each node with ahidden random variablex ∈ R2 representing the spatialprobability distribution of the feature in the image.

The edgese ∈ −→E of the graph model two aspects of

-RXIVREXMSREP�'SRJIVIRGI�SR�4EXXIVR�6IGSKRMXMSR��:SP��TT��'ST]VMKLX�-)))�

the feature structure. First, an edge signifies that the des-tination feature is a part of the source feature. Second, itdescribes the spatial relation between the compound featureand the subfeature in terms of distance. The annotation doesnot contain any specific information about a given scene butrepresents a static view of the feature hierarchy and shape.Each edgeet→s from featuret tos is associated with param-eters(dst, σst) that respectively define the distance betweenfeatures and the corresponding variance.

In our representation, we distinguish between two dif-ferent types of features, primitive and compound features.Primitive features correspond to low-level features. Theyare represented by a local descriptor made of orientation-normalized pixel values. Orientation normalization is ob-tained by rotating the region around the point in the gradi-ent direction. Descriptors are extracted at random locations,any other feature detector can be used. Compound featuresconsist of flexible geometric combinations of other subfea-tures (primitive or compound). Compound feature classesare represented by a non-leaf node whereas primitive fea-ture classes are leaf nodes at the first level of the graph.

3. Learning a Hierarchy of Feature Classes

The basic concept behind the learning algorithm is toaccumulate statistics of the relative positions of observedfeatures in order to findconspicuous coincidencesof fea-ture co-occurrences. The structure of our graphical modelis built incrementally by combining spatially correlated fea-tures into new feature abstractions.

Before the first iteration, the learning method applies aK-means clustering algorithm to the set of feature descrip-tors. This identifies primitive classes from the training set.These classes are used to create the first level (leaves) ofthe graphical model. After clustering, the procedure votesto accumulate information on the distance (Λi,j) betweenfeatures and extracts the feature pairs (S ← [fi, fj ]) thattend to be located in the same neighborhood. Then it esti-mates the parameters (mean and variance of the distance) ofthe spatial relation using Expectation-Maximization (EM).It finally creates new feature nodes in the graph by combin-ing the closest correlated features. This learning procedureis applied iteratively to each new level in the graph. Anoutline is given in Algorithm 1, and the main steps are de-scribed in the following paragraphs.

Algorithm 1 Learning: learn(G, level)1: Λ,S ← extract(G,level) // extract correlated features2: for all feature pairs[fi, fj ] ∈ S do3: Tij ←EM(Λi,j) // estimate models of spatial relation4: T ← T ∪ closest(Tij) // keep the best relations5: end for6: G ← generate(G, T ) // generate new features

3.1. Finding Spatially Correlated Features

A spatial correlation exists between two feature classesif they are often detected in the same neighborhood. Thekey idea to finding correlated features is to apply a votingscheme that will collect co-occurrence statistics (distances)from the training set. These statistics are collected frommultiple feature occurrences within one or across many dif-ferent images by considering each detected feature class andby voting for the class of the closest features detected in thesame region. After applying this procedure over the trainingset, we obtain a set of distancesΛ for every pair of classes.To extract spatially correlated classes, we keep the pairs thatobtained the largest number of votes in a tableS.

3.2. Estimating Spatial Relations

In our framework, spatial relations are defined in termsof distances between features. In order to estimatethis parameter from co-occurrence statistics, we use anExpectation-Maximization (EM) algorithm. In principle,a sample of observed distancesΛi,j between two givenfeature classes[fj , fi] can be approximated by a Gaus-sian mixture, where each component represents a clusterof distances between these features. EM is used to esti-mate the parameters of the spatial relation between eachcorrelated feature pair[fi, fj ] ∈ S. It maximizes the like-lihood of the observed distances over the model parame-ters Θ = (w1...K ; d1...K ;σ1...K). In practice, a mixtureof only two Gaussian components (K = 2) is sufficientto model spatial relations. The first component representsthe most probable distance, and the second is used to modelthe noise. When the distanced1 and varianceσ1 of the firstcomponent are estimated between featuresi andj, we storethem at entry[i, j] in a tableT .

3.3. Generating New Features

When a reliable spatial relation has been estimated be-tween a pair of feature classes[fi, fj ], the generation of anew feature is straightforward. We combine these featuresto create a new higher-level feature by adding a new nodefn in the graph. We connect it to its subfeatures[fi, fj ] bytwo edges{en→i, en→j} that are added to the edge setE .

The annotations of edges are computed using the meanand the variance of the distance[dij , σji] ∈ T estimatedduring the preceding step. The relative position (i.e. dis-tance) of the generated feature is chosen as the midpoint be-tween the subfeatures. Thus the distance from subfeaturesto the new feature is set to one half of the estimated dis-tance between subfeatures such thaten→i = {dij/2, σij}and en→j = {dji/2, σji}. Finally, the new edges andthe new feature are respectively added to the edge set,E ← E∪ {en→i, en→j}, and to the vertex set,V ← V∪ fn,of the current graphG.

4. Inferring High-Level Features

Inferring the probability density function of hierarchi-cal features conditioned on the detected primitives amountsto estimating the belief of the corresponding higher-levelnodes. Feature detection can thus be posed as inference ina graphical model. One way to do this is to use Nonpara-metric Belief Propagation (NBP) [8]. NBP is an inferencealgorithm for graphical models that generalizes particle fil-tering. This algorithm propagates information by a seriesof local message-passing operations. Following the NBPnotation, a messagemij from nodei to j is written

mi,j(xj) =∫ψi,j(xi, xj) φi(xi, yi)

∏k∈τi\j

mk,i(xi) dxi

(1)wherexi is the random variable associated with the nodei, τi denotes its neighbors andφi(xi, yi) is the observationpotential obtained via the detection of a low-level feature.The potentialψi,j models the spatial relation between nodesi, j. Since the priorp(xi) is uniform, it is proportional tothe conditionalψi,j(xj |xi). This is defined by a functionF (xi, di,j , σi,j) that maps the distribution of a nodei to aconnected nodej according to their spatial relation (Fig. 2).

Each random variablex is represented by a densityestimate that is defined as a set of weighted samples{µm, wm}Mm=1. The parameters are the meansµm andthe weightswm of the particles. During an iteration ofNBP, each vertex receives messages from connected nodes.Each message consists of a set of samples. By estimatingthe product of theses incoming messagesmk,i(xi), a nodecan compute the new local belief, an approximation of themarginal distributionp(xi|y):

p(xi|y) = φi(xi, yi)∏k∈τi

mk,i(xi) (2)

An exact computation of this product being intractable,we use the Gibbs sampler to drawM weighted samples{µm

i , wmi }Mm=1 from the product of incoming messages.

The numberM of samples used to represent each messageis proportional to the image size.

A message sent fromi to j specifies the location evi-dence of the destination nodej according to the belief avail-able in the source nodei. Such a message is produced bycomputing the product of messages received byi (exceptthe one received fromj) with the observation potential,

{µmp , w

mp }Mm=1 = φi(xi, yi)

∏k∈τi\j

mk,i(xi) (3)

Then we resample from the conditionalψi,j(xj |xi). To dothis, each sampleµm

p is moved by distancedij in a set ofrandom directionsθk ∈ [0, 2π[ (eq. 4). The resulting par-ticules constitute the new messagemi,j = {µm

ij , wmp }Mm=1.

µmij = µm

p + dij [sin θk cos θk] (4)

P (xj|xi)xi

F (xi, dij, σij) dij

σij

Figure 2. Function F that computes the con-ditional probability between two variables.

We approximate NBP by setting the variance of eachsample to a fixed value.

In our representation, higher-level features are not di-rectly observable. Evidence is incorporated into the graphexclusively via primitive classes at the leaves of the graph.The observation potentialsφi(xi, yi) are defined by Gaus-sian mixtures. For initialization, we match detected featuresin an image with the most similar classes. Each componentis set to a detected location and thus represents a possiblelocation of the feature class. The weight of a component isinversely proportional to the Mahalanobis distance betweenthe observed descriptor and the feature class. To deal withocclusions, we add a high-variance Gaussian located at theimage center to the observation potential.

5. ExperimentsWe have investigated the behavior of hierarchies on two

object recognition tasks. During the first experiment, wetrained our models on the objects of the COIL-100 database[5]. As a training set, we utilized 24 of the 72 availableviews uniformly distributed around each object. Our prim-itives were random patches described by rotation-invariantdescriptors of13 × 13 pixel values. We deliberately usedsimple, relatively inexpressive features to demonstrate thefunctioning of our method. The number of feature classeswas selected according to the BIC criterion. We limited thecombination of each feature to the three best spatial rela-tions and we ended the procedure after six levels of combi-nation. A graph resulting from learning one object modelis presented in Figure 3. Recognition was tested againstthe 48 remaining images of each object. We give in Fig-ure 4 the confusion matrices concerning the first 25 objectmodels for increasing number of levels (1 to 6) in the hier-archy. Brightness indicates the total number of detections.As we can observe, a single level of feature combinationsis weakly discriminant. Thanks to our hierarchy, selectivityarises from spatial combination of high-level features.

The second object recognition experiment was con-ducted on a database of real objects [6]. This image librarycontains color images of eight different real objects. Eachimage was captured at different arbitrary poses around theobject. For our experiments, we still used the same randomprimitives but kept the best five spatial relations of each fea-ture to construct a hierarchy. The models were evaluated on51 test images of the object recognition database. These

images represent cluttered scenes containing previously-learned objects along with distractors and various illumi-nation conditions. The recognition results (Figure 5) arecomparable to the state-of-the-art methods using 2D mod-els [3] or explicit 3D matching [6]. This is remarkable sincewe use neither robust features nor any discriminative in-formation to build our models. Ferrari’s method considersall training images independently which essentially reducesrecognition to the matching of robust features. By usingstatistical learning and a flexible representation, we obtaingood results with simple features.

6. Conclusion

We presented a generic framework for unsupervisedlearning of visual feature hierarchies. Our experimentsshow that the system works well for both isolated objects [5]and real scenes [6] in the presence of clutter and occlusion.They demonstrated the representational power of the hierar-chy; recognition rates increase with the depth of the hierar-chy. In comparison with previous approaches, the proposedframework offers several improvements. First, it allows thelearning and the represention ofshape flexibilityat differ-ent levels in the hierarchy by only considering the distancebetween features. Second, it does not rely on complex fea-tures such as SIFT features; simplerandom primitivescanbe used. Third, where previous methods were limited to afew high-level features (typically, less than 10), our systemis able to deal withdense graphical modelscontaining morethan 500 nodes. Finally, the recognition of3D real objectsis successful even in the presence of arbitrary 3D rotations,scale changes, lightning conditions and severe occlusions.

Figure 3. Graphical model learned from ran-dom patches for an object of COIL-100 [5].

leve

l 5

leve

l 6le

vel 3

leve

l 2

leve

l 1le

vel 4

Figure 4. Confusion matrices for one- to six-level models (COIL-100 [5]).

Figure 5. ROC curve obtained by our hierar-chical framework and two of the best state-of-the-art methods.

Figure 6. Object Detection, the illustrationscorrespond to the activation of the top levelfeature that had the highest response rate.

References

[1] G. Bouchard and B. Triggs. Hierarchical part-based visualobject categorization. InCVPR, pages 710–715, 2005.

[2] B. Epshtein and S. Ullman. Feature hierarchies for objectclassification. InICCV, pages 220–227, 2005.

[3] V. Ferrari, T. Tuytelaars, and L. V. Gool. Simultaneous ob-ject recognition and segmentation by image exploration. InECCV, pages 40–54, May 2004.

[4] S. Lazebnik, C. Schmid, and J. Ponce. Semi-local affine partsfor object recognition. InBMVC, pages 959–968, 2004.

[5] S. Nene, S. Nayar, and H. Murase. Columbia object imagelibrary (COIL-100). Technical Report CUCS-005-96, 1996.

[6] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3DObject modeling and recognition using local affine-invariantimage descriptors and multi-view spatial constraints. InIJCV,volume 66, pages 231–259, March 2006.

[7] F. Scalzo and J. Piater. Statistical learning of visual featurehierarchies. InIEEE Workshop on Learning in CVPR, 2005.

[8] E. Sudderth, A. Ihler, W. Freeman, and A. Willsky. Nonpara-metric belief propagation. InCVPR, pages 605–612, 2003.

[9] H. Wersing and E. Koerner. Unsupervised learning of com-bination features for hierarchical recognition models. InICANN, pages 1225–1230, 2002.

0This work was supported by the Wallonian Ministry of Research(D.G.T.R.E.) under Contract No. 03/1/5438.

unsupervised learning of dense hierarchical … · unsupervised learning of dense hierarchical...

Documents