recognizing objects in cluttered images using subgraph

Recognizing Objects in Cluttered Images using Subgraph Isomorphism ∗

Tushar Saxena, Peter Tu and Richard HartleyCMA Consulting, Schenectady, NY and

GE - Corporate Research and Development,P.O. Box 8, Schenectady, NY, 12301.

AbstractThis paper reports on a new approach to object

recognition based on exploiting both geometric andspectral information in an image. An algorithm isdescribed for finding objects in an image based on in-exact graph matching against a template that incor-porates information about the geometry and color ofsegmented regions of the image. The image is encodedas an attributed graph in which vertices represent re-gions, and are annotated with the position, shape andcolor of the image. Finding the template in a newimage takes place in three steps: local, neighborhoodand global matching. In the last step a maximal setof mutually compatible template / candidate imageregion assignments is sought. Currently the systemhas been evaluated using 3-band color images, thoughexperiments carried out with 4-band images suggestimproved performance with such images.

1 OutlineThe present paper describes a method for finding

objects in images. The typical situation is that onehas an image of the object sought. The task is to findthe object in a new image, taken from a somewhatdifferent viewpoint, possibly under different lighting.The method used is based on approximate attributedgraph matching. As a first step, the image is seg-mented into regions of approximately constant color.The geometrical relationship of the segmented coloredregions is represented by an attributed graph, in whicheach segment corresponds to a vertex in the graph, andproximate regions are joined by an edge. Vertices areannotated with the size, shape and color of the corre-sponding segment. Finding an object in a new image

∗This work was supported by DARPA contract F33615-94-C-1021, monitored by Wright Patterson Airforce base,Dayton OH. The views and conclusions contained in thisdocument are those of the authors and should not be in-terpreted as representing the official policies, either ex-pressed or implied of the Defense Advanced ResearchProjects Agency, the United States Government, or Gen-eral Electric

Figure 1: The result of edge detection in a templateimage containing a cup.

then comes down to an approximate graph-matchingproblem in which a match is sought in the new imagefor a subgraph approximating the one correspondingto the template. The graph matching can only be ap-proximate, because of the inexactness of the segmen-tation process, and the changed aspect of the object,due to change of lighting, viewpoint, and possible par-tial occlusion.

2 Extracting Object Faces from Im-ages

The first step in the algorithm is the division of theimage into faces (or regions) of approximately con-stant color. The face extraction process proceeds inthe three basic steps, which will be outlined in thefollowing subsections.

2.1 Detecting Approximate RegionBoundaries

First the boundaries (that is edges) hypothesized toenclose the region are detected. As a first step in thisprocess, the edges in the image are detected using aCanny-style edge detector, and line segments (largerthan a threshold) are fitted to the resulting edgels.It is reasonable to assume that the region boundariespass through the resulting line segments, since under

1

Figure 2: Constrained triangulation on the adjustedcup lines.

our assumptions, a face boundary will produce a dis-continuity in the intensity and color variation of theenclosed region, and thus show up during edge de-tection. However, typically, these boundaries are de-tected in the form of numerous small broken line seg-ments (see figure 1). It is difficult to identify the ex-act geometry of the enclosed faces directly from theseline segments. To improve the boundary geometries,we use some heuristics to further process the line seg-ments. Some of these heuristics are: merge adjacentnear-collinear lines; complete T-junctions and inter-sections by filling small gaps when appropriate. In ourexperience, these heuristics aid significantly in correct-ing most of the degenerate boundary segments.

2.2 Estimating Initial Uniform Regions:Constrained Triangulation

Using the boundary line segments from the previousstep, we now generate an initial partition of the im-age into triangles of uniform intensity and color. Thisis accomplished by a constrained triangulation of theboundary lines. A constrained triangulation producesa set of triangles which join nearest points (end-pointsof the lines), but respect the constraining boundarylines. That is, each boundary line segment will be anedge of some triangle. Since all triangles are formedfrom the end-points and lines on the boundaries of thefaces, each triangle lies completely inside a face. More-over, since these triangles cover the whole image, eachface can be represented by a union of a finite numberof these triangles. As an example, see the result ofconstrained triangulation on an image segmentationin figure 2.

In the next step, we describe the region mergingprocess, using which we derive the object faces fromthis initial triangulation.

Figure 3: Faces of the cup extracted by our algorithm.

2.3 Region MergingIn this step, a region-merging procedure is used to

incrementally generate the visible object faces in theimage. Starting with the triangular regions from theconstrained triangulation, neighboring regions are suc-cessively merged if they have at least one of the fol-lowing properties:

1. Similar color intensities: Two adjacent re-gions are merged if the difference between theiraverage color intensity vectors is less than athreshold. This is a reasonable merging propertysince neighboring faces in objects are at angles toeach other, and are likely to cast images of differ-ent intensities. As a refinement of this method,one could merge two regions based on a decisionof which of two hypotheses (the two regions areseparate; the two regions should form a single re-gion) is preferable based on the color statistics ofthe regions. In addition, a linear or more com-plex color gradient over a face could be modelled.These methods have been suggested in [7] but wehave not tried them yet.

2. Unsupported bridge: Two adjacent regionsare merged if the percentage of edges commonbetween them, which are unsupported, is largerthan a threshold. An edge is said to be supportedif a specified percentage of its pixels belong toan edgel detected by the edge detector. Mergingbased on this property will ensure the inclusionof those boundary segments which were missingfrom the set of line segments derived from edgels.

After each merge, the properties (size and color) ofthe new, larger region are recomputed from the prop-erties of the two regions being merged.

The merging iterations continue until the color in-tensities of each pair of neighboring regions are suffi-

2

ciently different, and most of the edges common be-tween them have support from the segmentation. Asan illustration, see figure 3. There, our complete merg-ing process extracts faces corresponding to the cup.The merging process is followed by a cleaning-up phaseto remove residual small and narrow regions that slightvariations of color have prevented from being mergedwith adjacent larger regions. The result of the seg-mentation and merging algorithm is a set of regionswith associated color (RGB) values.

3 Deriving Graph Representations ofObjects

Once all the object faces in the image have beengenerated, they are represented as a graph that cap-tures the relative placements of the faces in the image.

Each vertex in the graph represents a region, andis annotated with its shape, position and color at-tributes. Shape is represented by the moment matrixof the region, from which one may derive the area,along with the orientation and ratio of the principalaxes of the region. In effect, the region is being rep-resented as an ellipse. This shape representation is ofcourse an extremely rough representation of the shapeof the region. However, it is also quite forgiving ofvariations of shape along the boundaries, or even acertain degree of fragmentation of the region. Sincematching will not be done simply on the basis of aregion-to-region match, but rather on matching of re-gion clusters, this level of shape representation hasproven to be adequate. More precise shape estimateshave been considered, however. Their use must bedictated by the degree of accuracy and repeatabilityof the segmentation process. The color of the regionis represented by a spectral (in 3 band images, RGB)vector. Other color representations are of course pos-sible, and have been tried by other authors ([7, 8]).

Because of the possibility of regions being frag-mented or regions being improperly merged, it turnsout to be inappropriate to use edges in the graph torepresent physically adjacent regions. The adjacencygraph generated by such a rule is too sensitive to mi-nor variations in the image segmentation. Instead thechoice was made of joining each vertex to the verticesrepresenting the N closest regions in the segmentedimage. A value of N = 8 was chosen. Thus, eachvertex in the graph has 8 neighbors.

4 The three-tier matching method.The reduction of the image to an attributed graph

represents a significant simplification. The graph cor-responding to a typical complicated image (the searchimage) may contain up to 500 or so vertices, whereasthe graph corresponding to an object to be found (the

template) may contain 50 vertices or so. Thus a com-plete one-on-one comparison may be carried out inquite a short time.

The search is carried out in three phases, as follows:

1. Local comparison. A one-to-one comparisonof each pair of vertices is carried out. Each pairof vertices, one from the template graph and onefrom the search graph is assigned a score based onsimilarity of shape, size and color, within ratherliberal bounds.

2. Neighborhood comparison. The local neigh-borhood consisting of a vertex and its neighborsin the template graph is compared with a localneighborhood in the search graph. A score is as-signed to each such neighborhood pairing basedon compatibility, and the individual vertex-pairscores.

3. Global matching. A complete graph-matchingalgorithm is carried out, in which promisingmatches identified in the stage-2 matching arepieced together to identify a partial (or optimallya complete) graph match.

Each of these steps will be described in more de-tail in later sections. The idea behind this multi-stage matching approach is to avoid ruling out pos-sible matches at an early stage, making the matchingprocess robust to differences in the segmentation andviewpoint.4.1 Local matching.

In local matching, individual vertex pairs are eval-uated. Each pair is assigned a score based on shapeand color. Recall, that each region is idealized as anellipse. Shapes are compared on the basis of their sizeand eccentricity. Up to a factor of 2 difference in sizeis allowed without significant penalty. This allows fordifferent scales in the two images, within reasonablebounds.

Because of different lighting conditions, colors maydiffer between two images. The most significantchange in color, however is due to a brightness differ-ence. To allow for this, colors are normalized beforebeing compared. The color of a region is representedby a vector, and vectors that differ by a constant mul-tiple are held to represent the same color. The costof a local match between two vertices is denoted byClocal.4.2 Neighborhood matching.

Each vertex (here called core) in the graph has eightneighbors representing the eight closest regions. Incomparing the local neighborhood of one core vertex

3

v0 with the local neighborhood of a potential matchv′0, an attempt is made to pair the neighbor nodesof v0 with those of v′0. In this matching the orderof the neighbor vertices must be preserved. Thus,let v1, v2, . . . , vn be the neighbors of one core vertex,given in cyclic angular order around the core, and letv′1, . . . , v

′m be the neighbors of a potential match core,

similarly ordered. One seeks subsets S of the indices{1, . . . , n} and S′ of the indices {1, . . . ,m} and a one-to-one mapping σ : S → S′ so that the matchingvi ↔ v′σ(i) preserves cyclic order. The total cost of aneighborhood match is equal to

Cnbhd = w0 ∗ Clocal(v0, v′0) +

∑i∈S

Clocal(vi, v′σ(i))wi

where wi is a weight between 0 and 1 that depends onthe ratio of distances between the core vertices and theneighbors vi and v′σ(i). For each pair of core verticesv0, v

′0, the neighborhood matching that maximizes this

cost function is speedily and efficiently found by dy-namic programming.4.3 Graph matching

In previous sections, the template image and thesearch image were reduced to a graph, and candi-date matches between vertices in the two images werefound. The goal of this section is to generate a mutu-ally consistent set of vertex matches between the tem-plate and the search image. An association graph G[2, 4] provides a convenient framework for this process.In considering the association graph, it is importantnot to confuse it with the region adjacency graph thathas been considered so far. In the association graph,vertices represent pairs of regions, one from each im-age. Such a vertex represents a hypothesized match-ing of a region from the template image with a regionfrom the search image. Weighted edges in the associ-ation graph represent compatibilities between the re-gion matchings denoted by the two vertices connectedby the edge.

Thus, a vertex in the association graph is given adouble index, and denoted vij , meaning that it repre-sents a match between regionRi in the template imageand region R′j in the search image. This match may bedenoted by Ri ↔ R′j . As an example, if j1 �= j2 thenvij1 is not compatible with vij2 . This is because vertexvij represents a match Ri ↔ Rj1 and vij2 representsthe match Ri ↔ Rj2 , and it is impossible that regionRi should match both R′j1 and R′j2 . Thus, verticesvij1 and vij2 are incompatible and there is no edgejoining these two vertices in the association graph.There are other cases in which matches are incompat-ible. For instance, consider a vertex vij representing amatch Ri ↔ R′j and a vertex vkl representing a match

Rk ↔ R′l. If regions Ri and Rk are close together inthe template image, whereas R′j and R′l are far apartin the search image, then the matches Ri ↔ R′j andRk ↔ R′l are incompatible, and so there is no edgejoining the vertices vkl and vij . Matches may also beincompatible on the grounds of orientation or color.

Formally, the association graph G = {V,E} iscomposed of a set of vertices V and a set of weightededges E ⊆ V × V. Each vertex v represents a pos-sible match between a template region and a searchregion. If there are N template regions and M searchregions then V would have NM vertices (see figure4). In order to reduce the complexity of the prob-lem, the graph G is pruned so that only the top 5assignments for each template region are included inV. These nodes are labeled vij which is interpretedat the jth possible assignment for the ith template re-gion. A slack node for each template region is insertedinto the graph. The slack node vi0 represents the pos-sibility of the NULL assignment for the ith templateregion. If an edge e = (vij , vkl) exists then the assign-ments between nodes vij and vkl are considered com-patible. The weights for the edges are derived fromthe compatibility matrix C which is defined as:

C(ij)(kl) =

0 if j = 0 or k = 00 if i = k and j �= l0 to 1 if (i, j) = (k, l)0 to 1 if vij and vkl are compatible−N if vij and vkl are not compatible

WhereN is the number of template regions. The valueof C(ij)(ij) represents the score given to the individualassignment defined by node vij . A subgraph of G rep-resents a solution to the matching problem. In section4.5 criteria for selecting the solution set will be de-scribed.

4.4 Compatibility scoresThe method of assigning compatibility scores is

as follows. Consider a candidate region pair Ri ↔R′j . The local neighborhood of region Ri has beenmatched with neighborhood R′j during the neighbor-hood matching stage. In doing this, a set of neighborsof the region Ri have been matched with the neigh-bors of the region R′j . This matching may be consid-ered as a correspondence of several regions (a subsetof the neighbors of Ri) with an equal number of re-gions in the other graph. From these correspondencesa projective transformation is computed that maps thecentroid of Ri to the centroid of R′j while at the sametime as nearly as possible mapping the neighboring re-gions of Ri to their paired neighbors of R′j . Thus, theneighborhood correspondence is modelled as closely as

4

Template Objects

a1

a2

a3 a4

c4

c3c2

c1

b4

b3

b2

b1

Reference Objects

Association Graph

b

c

1

23

4

a

Figure 4: The template and search images are reducedto a set of regions. Each possible pair of assignmentsare assigned to a node in the association graph. Edgesin the graph connect compatible assignments. Notethat although match b1 is compatible with c4 and c3,c3 and c4 are not compatible. The best solution is theclique {a1, b3, c4} representing a match of regions a, band c from the first image with regions 1, 2, 3 from thesecond.

possible by a projective transformation of the image.Let H be the projective transformation so computed.

Now let Rk ↔ R′l be another candidate regionmatch. To see how well this is compatible with thematch Ri ↔ R′i, the projective transformation H isapplied to the region Rk to see how well H(Rk) cor-responds with R′l. As a measure of this correspon-dence, the vector from R′jR

′l is compared with the vec-

tor R′jH(Rk). A compatibility score is assigned basedon the angle and length difference between these twovectors. The two assignments are deemed incompat-ible if the angle between the two vectors exceeds 45degrees, or their length ratio exceeds 2.

A color compatibility score is also defined. The cor-respondence of a core vertex and its neighors with thematched configuration in the other image can be usedto define an affine transformation of color space fromthe one image to the other. An affine color transfor-mation is a suitable model for color variability underdifferent lighting conditions ([9]). The affine trans-formation defined for one matched node pair is usedto determine whether another matched node pair iscompatible.4.5 Clique finding

A popular graphical approach which can take ad-vantage of some of the information contained in theedge structure of the association graph is a node clus-tering technique where a simple depth first searchis used to determine the largest connected subgraphof G, that is, the largest set of compatible regionmatches. A connected graph is defined such that apath of edges exists between every pair of nodes inthe graph. This solution represents a certain amount

of consistency. However, The statement that node ais consistent with node b and node b is consistent withnode c does not necessarily imply that node a is consis-tent with node c. This leads to the conclusion that inorder to take full advantage of the mutual constraintsembedded in the association graph, the final solutionshould represent a clique on G.

A subset R ⊆ V is a clique on G if vij , vkl ∈ Rimplies that (vij , vkl) ∈ E. The search for a maximumclique is known to be an NP complete problem [6].Even after pruning, the computational costs associ-ated with exhaustive techniques such as [1] would beprohibitive. It has been reported [3] that determininga maximum clique is analogous to finding the globalmaximum of a binary quadratic function. Authorssuch as [10, 11] have taken advantage of this idea byusing relaxation and neural network methods to ap-proximate the global maximum of a quadratic func-tion, where this maximum corresponds to the largestclique in the association graph. Although the largestclique, which is based on the information containedin E, ensures a high level of mutual consistency, thenuances of the compatibility measures in C are lost.In order to take advantage of the continuous nature ofthese edge strengths, a quadratic formula is specifiedwhere the global maximum corresponds to the cliquethat has the maximum sum of internal edge strengths.An approach based on Gold and Rangarjans’s gradualassignment algorithm (GAA, [5]) is used to estimatethe optimal solution. The GAA is an iterative op-timization algorithm which treats the problem as acontinuous process but converges to a discrete solu-tion. Even though the solution might be generatedbased on a local maximum this solution will be guar-anteed to be a maximal clique, because of the choiceof the compatibility measure which strongly discour-ages incompatible matches. Although generation ofa maximal clique is an NP complete problem, nev-ertheless the quadratic function approach provides away of rapidly finding an approximate solution. Theproblem is formulated in terms of minimizing a con-tinuous quadratic form via a descent algorithm, andthen gradually imposing constraints to converge to-wards a binary solution. Details of the clique-findingalgorithm are given in [13].

5 ResultsThe system was evaluated on several sets of im-

ages. Figures 5 and 6 show the results of searching indifferent types of 3-band images.

6 More advanced color models.The color constancy model used in the existing al-

gorithm is based on the observation ([9]) that colors

5

Figure 5: Recognizing a building. On the left the template, and on the right the search image showing the recognizedbuilding.

Figure 6: Recognition of cup image. On the left is the template, in the center the search image and on the rightthe identified regions of the located cup. Note that the cup in the search image is seen from a different angle fromthe template image.

6

are modified by an affine transformation under dif-ferent lighting conditions. Thus, the color of a corevertex and its neighbors may be modified by an un-known (but constant over the image) affine transfor-mation. This observation is used in defining the as-signment compatibility function. At the local match-ing level, however, single region pairs are compared,using simple color scaling in color comparison. Ourintention is to use a more sophisticated color compari-son method based on the methods of Healey and Slater([12]). In this paper, multi-band spectral variation isemodelled as a linear subspace of the high-dimensionalcolor space. We intend to apply this method to 4-bandimagery. Use of this technique should give better lo-cal region matching. In addition, exploiting the extraband should give better results that we have been ableto achieve with three-band color imagery.

To support this expectation, we have implementeda color-segmentation algorithm based on modellingcolors as one and two-dimensional subspaces in the4-dimensional color space. Mahalanobis distance isused to determine whether a pixel belongs to a givencolor cluster or not. The results of these experimentsare shown in figure 7 and 8.

7 Conclusion.The amalgamation of region segmentation algo-

rithms with modern color constancy methods gives thepossibility of improved object recognition in color andmulti-spectral imagery. The adoption of an inexactgraph-matching approach makes recognition indepen-dent of moderate lighting and view-point changes.

References[1] Ambler A.P., Barrow H.G., Brown C.M., Burstall

R.M., Popplestone R.J., ‘A versatile computer-controlled assembly system’, IJCAI, pages 298-307, 1973.

[2] Ballard D.H., Brown C.M., ’Computer Vision’,Prentice-Hall, Englewood Cliffs, NJ, 1982.

[3] Batahan F., Junger M., Reinelt G., ‘Experi-ments in Quadratic 0-1 programming’ Mathemat-ical Programming, vol. 44, pp. 127-137, 1989.

[4] Faugeras O.,‘Three-dimensional computer vi-sion’, MIT Press, 1993.

[5] Gold S. and Rangarjan A., ’A gradual assignmentalgorithm for graph matching’, IEEE Transac-tions on PAMI, Vol. 18 No 4, April 1996, pages377 – 387.

[6] Gibson A.,‘Algorithmic graph theory’, Cam-bridge University Press, Cambridge (MA), USA,1985

[7] Allen R. Hanson and Edward M. Riseman, ’Seg-mentation of Natural Scenes’, in Computer Vi-sion Systems, (edited A. Hanson and E. Rise-man), Academic Press, 1978, pages 129 – 164.

[8] Allen R. Hanson and Edward M. Riseman, ’VI-SIONS : A computer system for interpretingscenes’, in Computer Vision Systems, (edited A.Hanson and E. Riseman), Academic Press, 1978,pages 303 – 334.

[9] Glenn Healey and David Slater, ’ComputingIllumination-Invariant Descriptors of SpatiallyFiltered Color Image Regions’, IEEE Transac-tions on Image Processing, Vol. 6 No 7, July 1997,pages 1002 – 1013.

[10] Lin, F.‘A parallel computation network for themaximum clique problem’, Proceeding 1993 in-ternational symposium on circuits and systems,pp. 2549-52, vol. 4, IEEE, May 1993.

[11] Pelillo M., ‘Relaxation labeling Networks thatsolve the maximum clique problem’, Fourth inter-national conference on artificial neural networks,pp. 166-70, published by IEE June 1995.

[12] David Slater and Glenn Healey, Exploiting an At-mospheric Model for Automated Invariant Ma-terial Identification in Hyperspectral Imagery,preprint.

[13] Peter Tu, Tushar Saxena and Richard Hartley,’Recognizing objects in cluttered images usingsubgraph isomorphism’, to appear in SpringerLNCS series, conference Palermo, Sicily, 1998.

7

Figure 7: Top. Extraction of roads from four-band imagery. In the four band image (left) colours are quitemuted (though this is not evident from this grey-scale image), and different surface types are hard to distinguishon the basis of 3-band color. Roads may be extracted from the four-band imagery (RGB + near infrared) basedon pixels’ spectral content. A sampled region of road is analyzed and modelled as a 2-dimensional or (as here)1-dimensional space passing through the origin. Pixels are then classified as being road or non-road based onMahalanobis distance from the classifying hyperplane. The image is then filtered to remove scattered wrongly-classified points. The roads stand out clearly in the processed image (right). Bottom. The same experimenton a larger scale image. Pixels are classified as road/non-road. The agglomeration of roads in built-up areas isevident. Zooming in reveals details of the road structure.

Figure 8: This figure shows the trade-off between using a 1-dimensional (left) or a 2-dimensional (right) linearsubspace to identify road pixels. The 2-dimensional classifier allows more natural and lighting variability of thepixel color. This has the effect of picking up more slightly differently colored road regions, while at the sameallowing other non-road pixels to be wrongly classified.

8

recognizing objects in cluttered images using subgraph

Documents