[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

Shape Recognition Using Vector Quantization

Antonella Di Lillo*, Giovanni Motta†, James A. Storer‡

Abstract We present a framework to recognize objects in images based on their silhouettes. In previous work we developed translation and rotation invariant classification algorithms for textures based on Fourier transforms in the polar space followed by dimensionality reduction. Here we present a new approach to recognizing shapes by following a similar classification step with a "soft" retrieval algorithm where the search of a shape database is based on the VQ centroids found by the classification step. Experiments presented on the MPEG-7 CE-Shape 1 database show significant gains in retrieval accuracy over previous work. An interesting aspect of this recognition algorithm is that the first phase of classification seems to be a powerful tool for both texture and shape recognition.

1. Introduction Perceptual properties such as color, texture and object shape are attributes used to extract information from an image or video. The Moving Picture Experts Group (MPEG) has formally incorporated these concepts into their latest specification for a multimedia content descriptor interface, MPEG-7.

Shape is an important visual feature; however, describing a shape is a difficult task for several reasons. Real-world objects are three dimensional, and when a 3-D object is projected onto a 2-D plane (as happens when taking a photograph), one dimension of the object is lost. Silhouettes extracted from a projection only partially represent that 3-D object and can dramatically change depending on the projection axis. Moreover, in contrast to color and texture, shapes are usually extracted after the image has been segmented into regions (for example, after separation of background and foreground), thus the contour of the shape is often corrupted by noise, arbitrary distortions, and occlusions.

Di Lillo, Motta, and Storer [3] presented FPFT, an algorithm that extracts low-level features invariant to certain geometric transformations to achieve high performance classification and texture segmentation. Algorithms based on this feature extraction technique were tested by using benchmarks from the Outex database [9], a reference database containing a large collection of problems composed of synthetic and natural images. The problems in the Outex database require the recognition of textures distorted by various naturally occurring transformations, such as rotation, scaling, and translation.

In this work, we further investigate the robustness of the FPFT algorithm by applying it to the recognition of silhouettes resulting from image segmentation. By using the same feature extraction technique, our method is able to capture the characteristics of the boundary contour and recognize the shape. Although FPFT was previously used for

* [email protected], Brandeis University, Waltham, MA. † [email protected], Hewlett-Packard Corp., Cupertino, CA. ‡ [email protected], Brandeis University, Waltham, MA.

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.97

484

texture classification and texture-based image segmentation, here it is re-tasked as the first phase of a highly effective shape recognition algorithm.

Section 2 reviews previous work, Section 3 presents our RBRC (Retrieval Based on Rotation-invariant Classification) algorithm, Section 4 presents experiments that compare this algorithm on the MPEG-7 CE-Shape 1 database to previous work in the literature, and Section 5 concludes.

2. Previous Work Many ad-hoc shape representation techniques have been proposed and evaluated by measuring how precisely they retrieve similar shapes from a reference database. Shape representation techniques have been categorized into contour-based and region-based descriptors (Bober [2]).

Contour-based methods extract shape features from the object boundaries, which are crucial to human perception of shape similarity. These features can be extracted by using a continuous or a discrete approach. With a continuous approach, the feature vector is derived from the entire shape boundary, where the measure of shape similarity is either point-based matching or feature-based matching. A discrete approach breaks the shape boundary into segments, called primitives, using techniques such as polygonal approximation, curvature decomposition, or curve fitting; the resulting shape descriptor is usually a string or a graph, allowing the use of a similarity measure based on string or graph matching. In contrast, region-based methods do not necessarily rely on boundaries, but instead extract shape features by looking at the shape as a whole. All pixels within a region are taken into account to obtain the shape representation. Since they extract features from internal and possibly boundary pixels, region-based methods can describe simple objects with or without holes as well as complex objects consisting of disconnected regions. Many shape descriptors have been proposed, each with its advantages and disadvantages. Since it is believed that humans discriminate shapes based primarily on their boundaries, contour-based methods have been often been favored over region-based methods. Techniques with which we have compared our algorithm include:

Curvature Scale Space (CSS) shape descriptors (Mokhtarin, Abbasi, and Kittler [8]) reduce the contours of a shape into sections of convex and concave curvature, which is a tendency commonly observed in human perception. The CSS technique determines the position of the points at which the curvature is zero in order to delineate transitions between regions of convex and concave curvature. To achieve this, the shape boundary is analyzed at different scales, i.e. filtering the contour using low-pass Gaussian filters of variable widths. To extract the CSS descriptors, the CSS contour map showing the multi-scale organization of the zero-curvature points must first be computed. An iterative process of filtering is conducted, each time at a new scale, which continues until no new curvature-zero points are found. A final CSS contour map is assembled, which contains all curvature-zero points. At this point, the CSS shape descriptors are formed by extracting the maxima from the CSS contour map, which are then listed in descending order. To determine the similarity of two shapes, the sum of the differences between matched peaks and the peak values of the unmatched peaks are measured.

Shape contexts (SC) proposed by Belongie, Malik, and Puzicha [1], is a correspondence-based shape matching technique that is an improvement of the traditional Hausdorff distance. In correspondence-based shape matching, the shape contours are

485

sampled using a subset of points. Each sample point is used to attempt a point-to-point matching, and the degree to which this it possible is used to calculate similarity. The Hausdorff distance is a classic example of correspondence-shape matching, and it measures the maximum distance between nearest points in two sets. This does allow partial matches, but is not invariant to rotation, translation, and scaling. The descriptor for shape contexts represents a reference point around which the coarse distribution of the other points is specified. This specification is captured as a log-polar histogram, which has the side effect of increased sensitivity to those points closest to the center of the image. In determining the correspondence between shapes, sample points from both images are compared using their shape contexts, which is done using matrix-based matching.

Inner-distance (ID), proposed by Ling and Jacobs [7] is instead a skeleton-based approach that takes interior points into consideration when building its descriptors. Starting with two chosen landmark points, the inner-distance is calculated by capturing the shortest path between those points that also remains within the shape boundary. This method results in the extraction of features that are insensitive to the articulation of structural components of a complex shape while remaining highly sensitive to variations in the components themselves. The inner-distance differs from other skeleton-based approaches since it is not keeping a persistent record of the path structures after determining their lengths. This results in a low sensitivity to boundary disturbances that typically penalize skeleton-based approaches. This characteristic supports robust use of inner-distance in shape descriptors.

3. The RBRC Algorithm Di Lillo, Motta, and Storer [3] have studied FPFT, a rotation-invariant feature extraction method and tested it on the classification and segmentation of textured images. FPFT is used here as the first phase of an algorithm to recognize silhouettes (obtained from image segmentation) by capturing the characteristics of the boundary contour and recognizing its shape.

The RBRC (Retrieval Based on Rotation-invariant Classification) algorithm presented here aims to answer the following question: Given a database of shapes, which shapes in the database are most similar to a given shape query? This problem is particularly important for applications of Content-Based Image Retrieval. To achieve satisfactory retrieval accuracy, the shape descriptor should be able to identify similar shapes that have been rotated, translated, and scaled, as well as shapes corrupted by noise and various distortions.

While Di Lillo, Motta, and Storer [3] show that FPFT is capable of extracting features that are invariant under rotation and translation, the peculiar distortions that affect silhouettes of 3-D objects are substantially different. This phenomenon is illustrated in Figure 1, which shows four sample shapes out of 20, selected from five of the 70 classes represented in the MPEG-7 CE-Shape-1 database (the database used in our experiments). Besides undergoing arbitrarily scaling, translation and rotation, shapes belonging to the same class may be captured from independent instances of an object, possibly containing unique variations that may look different in ways that are hard to describe by mean of a simple transformation.

486

Apple

Bone

Camel

Device2

Fly

Figure 1: Sample shapes from five classes represented in the MPEG-7 CE-Shape-1 database.

Figure 2: Layout of the feature extractor.

3.1 Feature Vector Extraction The first step of the RBRC retrieval uses a version of the feature extraction technique presented in the FPFT method of Di Lillo, Motta, and Storer [3], as depicted in Figure 2. First, a two-dimensional Fourier transform is applied to the original image. The output of the transform produces a stage-2 image composed of the magnitude of the Fourier coefficients. This step introduces invariance to translation in the case that the shape is not completely centered in the window. The stage-2 image is then transformed into polar coordinates, producing a stage-3 image. The coordinates of the polar transform are defined relative to the center of the window, which is why the translation invariance

487

introduced by the Fourier transform is so important. With the image so transformed, a rotation of the input image produces an output image that is translated rather than rotated. The effect of this translation can be eliminated by again applying the translation-invariant Fourier transform, to produce a stage-4 image. The two-dimensional result is then linearized and treated as a one-dimensional feature vector that is used as a shape descriptor. Because this feature vector may be large, in our experiments we have reduced its dimensionality by using Fisher’s discriminants. Fisher’s discriminants maximize the separation between the classes while minimizing overlap between them. By measuring the discriminative power of each feature, the dimensions that do not help with the classification can be safely discarded.

One major difference between how FPFT is used on textures [3] and how it’s used here in the RBRC algorithm is that in the case of textured images, features were extracted on a pixel-by-pixel basis. In the case of a shape, the FPFT is computed once for the whole input image.

3.2 Classification and Retrieval Shape images are retrieved after they have been classified using a supervised approach, that is, a classifier that is first trained, and then tested. Feature vectors are extracted as described in the previous section, one vector per sample image, and classified using a vector quantizer. Training extracts a small set of significant and discriminating features from shape images. The features extracted characterize the properties of the training samples. The basis for the classification consists of a set of h features extracted from n training images, which are formed by shapes belonging to k different classes.

Training our classifier consists of finding, for each class, a small set of “typical” feature vectors that, during the classification, will be compared to an unknown signature to determine the class to which it belongs. We employ a vector quantizer, and for each class, determine a small set of centroids. Centroids are computed independently for each class, so that they minimize the mean squared error (MSE) with the feature vectors collected from the sample shapes.

To classify an unknown shape image, we extract its signature and compare it to the c × k centroids (c centroids for each of the k classes). The class associated with the centroid that is closest (in the MSE sense) to the signature is finally assigned to the shape. As shall be seen in the next section, a small number (e.g., c = 4) is sufficient to fully characterize a class.

4. Experimental Results This section describes experiments to test the RBRC algorithm. The database used for our experiments is the widely used MPEG-7 CE-Shape-1 [6]. It consists of 1,400 silhouette images divided into 70 classes each containing 20 images; Figure 2 shows a representative sample for each of the 70 classes.

This database presents a challenge because shapes may not only be rotated, but may also contain distortions such as occlusion and deformation. Moreover, some classes contain objects whose shape is significantly different; an issue that is evident in the 20 images from 5 classes displayed in Figure 1.

488

Figure 3: A sample for each of the 70 classes in the MPEG-7 CE-Shape-1 database.

Three basic sets of experiments have been performed, each for a range of centroids and each set computed with a slightly different method: classification, simple retrieval, and bull’s eye retrieval.

Classification: The classification can be seen as an immediate application of the vector quantization. After collecting the signatures for all shapes in a given class, a set of c centroids is determined.

489

65.86

84.29

92.79 93

70.86

55.77

75.14

88.17 87.4

57.18

84.46 83.71

91.79 90.96

79.78

55

60

65

70

75

80

85

90

95

1 2 4 8 16

Accu

racy (

%)

Number of Centroids

Classification and Retrieval rate

Classification Retrieval Bull's eye 20/40 Figure 4: Classification, retrieval and bull’s eye retrieval as a function of centroids number. All three curves show the net accuracy resulting from querying each image of the database on the

entire database.

The process is repeated for each one of the 70 classes in the database. The signature of each shape is then compared to the c × 70 centroids and the class associated to the closest one is finally assigned to the shape. The number of correct classifications constitutes the final score.

Simple Retrieval: The simple retrieval simulates the behavior of a system that, given an input query image, accesses a database of images to retrieve all images belonging to the same class. After determining the c × 70 centroids as described in the section above, each of the 1,400 images is used in turn to query the database. First, its class is determined as described in classification, then all images in the database that have been classified in the same class are retrieved and counted. The total number of images correctly retrieved represents the final score.

Bull’s Eye Retrieval: Much of the existing literature reports retrieval results with a method called the bull’s eye score. The bull’s eye score exemplifies the most typical use of a system that retrieves images based on their content. Given a query image, the N most similar images are retrieved from the database and the number of images belonging to the same class of the query is counted. Previous work by others on the MPEG-7 CE-Shape-1 database, to which we compare ours, used N = 40. Here, the c × 70 centroids are determined as previously described. Then each of the 1,400 images is used to query the database.

490

Table I: Bull’s eye retrieval for MPEG-7 CE-Shape-1.

Method Score CSS [8] 75.44 %

Visual Parts [6] 76.45 %

SC+TPS [1] 76.51 %

Curve Edit [10] 78.71 %

Distance Set [4] 78.38 %

MCSS [5] 78.80 %

Generative Models [11] 80.03 %

MDS+SC+DP [7] 84.35 %

IDSC+DP [7] 85.40 %

RBRC, 4 centroids 91.79 %

The class of the query image is determined first, then the c centroids corresponding to this class (whether it is correct or not) are used to retrieve 40 images having minimum squared error with one of the c centroids. Finally, the number of images belonging to the original class of the query image is counted.

Results for these three methods are reported in Figure 4; all three curves show the net accuracy resulting from querying each image of the database on the entire database. The scores of the bull’s eye retrieval are compared with results in the existing literature in Table I. For c = 4 centroids, our method significantly improves upon the best result previously reported in the literature (91.8% as compared to 85.4%). It is noteworthy that the improvement achieved by the RBRC algorithm is based on feature extraction that has been used to improve the state-of-the-art in both texture segmentation and texture classification. That is, FPFT proves to be a powerful representation of image features that can be adapted as a key component of RBRC to address very different problems.

The bull's eye measure uses the parameter N = 40; that is, out of the top 40 retrieved, it counts how many are in the correct class. The choice of N = 40 is natural for the MPEG-7 CE-Shape-1 database because it is twice the number of shapes in each class, allowing for a reasonable but not overly large number returned for a given query (40) as compared to the maximum possible number of correct answers for a given query (20). Although N = 40 is the standard used in the past to which we have compared our results, it is worthwhile to note that performance of RBRC changes gracefully with N; in fact even at N = 20, RBRC about equals the best of the past results shown in Table I that use N = 40. Table II shows these results, with the top row having the value of N and the bottom row showing the percent success when the number of correct classification out of the top N retrieved is counted.

491

Table II: RBRC retrieval for MPEG-7 CE-Shape-1 using 4 centroids with different parameters for the bull's eye measure.

20 25 30 35 40 45 50 55 60 84.60% 87.21% 88.66% 90.28% 91.79% 91.96% 92.62% 93.10% 93.78%

Table III: Classification and retrieval on half of the database (4 centroids).

Train 1st half 1st half 2nd half 2nd half Test 1st half 2nd half 1st half 2nd half

Classification 93.57% 62.14% 62.29% 91.00%

Retrieval 89.11% 54.69% 55.09% 85.29%

Bulls Eye 94.47% 94.30% 92.06% 91.09%

Separation of Training from Testing Experiments presented thus far have queried each image in the database against all others in order to compute statistics, which allows us to directly compare our results to those reported in literature. However, it is useful to ask how the training and testing of the system on the same sample shapes affects the results of classification and retrieval. Table II presents experiments to address this question. In these experiments, the MPEG-7 CE-Shape-1 is divided into two halves, each containing 10 shapes for each of the 70 classes. The first half contains shapes numbered from 1 to 10, the second the shapes numbered 11 to 20. Results for classification, retrieval and bull’s eye retrieval are determined by first determining the centroids on one half of the database, and then testing the method on the other half. Table II shows both training of the first half and retrieval on the second, and vice-versa.

As can be seen from Table III, the bull’s eye retrieval is basically unaffected when using only half of the database for training (still over 91%), which is further evidence of the robustness of RBRC. Performance of classification and simple retrieval, however, is greatly reduced.

5. Conclusion

We have presented a framework that can be used to recognize objects based on their silhouettes. It has been tested on the MPEG-7 CE-Shape 1 database and when used in bull’s eye retrieval, it significantly outperforms previous methods. The RBRC algorithm is a robust way to exploit the feature extraction algorithm of our previous work (Di Lillo, Motta, and Storer [3]) on texture classification and segmentation (using FPFT). This result is particularly noteworthy since the state-of-the-art in feature extraction has traditionally been achieved with statistical or ad-hoc models, while frequency-based models have often been assumed to be inferior.

492

References[1] S. Belongie, J. Malik, and J. Puzicha. “Shape Matching and Object Recognition Using

Shape Context”. IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 509-522, 2002.

[2] M. Bober. “MPEG-7 Visual Shape Descriptors”. IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 716-719, 2001.

[3] A. Di Lillo, G. Motta, and J. A. Storer. Multiresolution Rotation-Invariant Texture Classification Using Feature Extraction in the Frequency Domain and Vector Quantization. In Data Compression Conference, 2008. DCC 2008, pp. 452-461, 2008.

[4] C. Grigorescu and N. Petkov. “Distance Sets for Shape Filters and Shape Recognition”. IEEE Transactions on Image Processing, vol. 12, no. 10, pp. 1274-1286, 2003.

[5] A. C. Jalba, M. H. F. Wilkinson, and J. B. T. M. Roerdink. “ Shape Representation and Recognition through Morphological Curvature Scale Spaces”. IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 331-341, 2006.

[6] L. J. Latecki, R. Lakamper, and U. Eckhardt. Shape Descriptors for Non-Rigid Shapes with a Single Closed Contour. Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. I, pp. 424-429, 2000.

[7] H. Ling and D. W. Jacobs. “Shape Classification Using the Inner-Distance”. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2, pp. 286-299, 2007.

[8] F. Mokhtarian, S. Abbasi, and J. Kittler. “Efficient and Robust Retrieval by Shape Content through Curvature Scale Space”. Image Databases and Multi-Media Search, A. W. M. Smeulders and R. Jain eds., pp. 51-58, World Scientific, 1997.

[9] T. T. Ojala, T. Mäenpää, M. Pietikäinen, J. Viertola, J. Kyllönen, and S. Huovinen. “Outex-New Framework for Empirical Evaluation of Texture Analysis Algorithms”. Proc. 16th Int. Conf. on Pattern Recognition, 1, 701-706, 2002.

[10] T. Sebastian, P. Klein, and B. Kimia. “On Aligning Curves”. IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 25, no. 1, pp. 116-125, 2003.

[11] Z. Tu and A. L. Yuille. “Shape Matching and Recognition-Using Generative Models and Informative Features”. Proc. European Conf. on Computer Vision, vol. 3, pp. 195-209, 2004.

493

[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

Documents