edge-based template matching and tracking for ... · edge-based template matching and tracking for...

Edge-based Template Matching and Tracking for

Perspectively Distorted Planar Objects

Andreas Hofhauser, Carsten Steger, and Nassir Navab

TU Munchen, Boltzmannstr. 3, 85748 Garching bei Munchen, GermanyMVTec Software GmbH, Neherstr. 1, 81675 Munchen, Germany

Abstract. This paper presents a template matching approach to highaccuracy detection and tracking of perspectively distorted objects. Tothis end we propose a robust match metric that allows significant per-spective shape changes. Using a coarse-to-fine representation for the de-tection of the template further increases efficiency. Once a template is de-tected at interactive frame-rate, we immediately switch to tracking withthe same algorithm, enabling detection times of only 20ms. We show ina number of experiments that the presented approach is not only fast,but also very robust and highly accurate in detecting the 3D pose ofplanar objects or planar subparts of non-planar objects. The approachis used in augmented reality applications that could up to now not besufficiently solved because existing approaches either needed extensivetraining data, like machine learning methods, or relied on interest pointextraction, like descriptors-based methods.

1 Introduction

Methods that exhaustively search a template in an image are one of the oldestcomputer vision algorithms used to detect an object in an image. However, themainstream vision community has abandoned the idea of an exhaustive search.There are two prejudices that are commonly articulated. First, that an objectdetection based on template matching is slow and second, that an object de-tection based on template matching is certainly extremely inefficient for, e.g.,perspective distortions where an 8-dimensional search space must be evaluated.In our work, we address these issues and show that with several contributionsit is possible to benefit from the robustness and accuracy of template matchingeven when an object is perspectively distorted. Furthermore, we show in a num-ber of experiments that it is possible to achieve an interactive rate of detectionthat was only possible with a descriptor-based approach until now. In fact, if theoverall search range of the pattern is restricted, real-time detection is possibleon current standard PC hardware. Furthermore, as we have an explicit represen-tation of the geometric search space, we can easily restrict it and therefore usethe proposed method for high-speed tracking. This is in strong contrast to manyother approaches (e.g., [1]), in which a difficult decision has to be made when toswitch between a detection and a tracking algorithm. One of the key contribu-tions of any new template matching algorithm is the image metric that is used

to compare the model template with the image content. The design of this met-ric determines the overall behavior and its evaluation dominates the run-time.One key contribution of the proposed metric is that we explicitly distinguishbetween model contours that give us point correspondences and contours thatsuffer from the aperture problem. This allows us to detect, e.g., an assembly partthat contains only curved contours. For these kinds of objects, descriptor-basedmethods, that excel as they allow for perspective shape changes, notoriously failif the image contains not enough or only a small set of repetitive texture like inFigure 1.

Fig. 1. An object typically encountered in assembly scenarios. A planar sub-part ofa non-planar object is taken as model region and the detection results are depictedas the white contours. The different views leads to significant non-linear illuminationchanges and perspective distortion. The object contains only curved contours, andhence extraction of discriminative point features is a difficult task.

1.1 Related Work

We roughly classify algorithms for pose detection into template matching anddescriptor-based methods. In the descriptor-based category, the rough schemeis to first determine discriminative “high-level” features, extract surroundingdiscriminative descriptors from these feature points, and to establish the corre-spondence between model and search image by classifying the descriptors. Thebig advantage of this scheme is that the run-time of the algorithm is independentof the degree of the geometric search space. Recent prominent examples that fallinto this category are [2–5]. While showing outstanding performance in severalscenarios, they fail if the object has only highly repetitive texture or only sparseedge information. The feature descriptors overlap in the feature space and arenot discriminating anymore.

In the template matching category, we subsume algorithms that performan explicit search. Here, a similarity measure that is either based on inten-sities (like SAD, SSD, NCC and mutual information) or gradient features isevaluated. However, the evaluation of intensity-based metrics is computation-ally expensive. Additionally, they are typically not invariant against nonlinearillumination changes, clutter, or occlusion.

For the case of feature-based template matching, only a sparse set of fea-tures between template and search image is compared. While extremely fast androbust if the object undergoes only rigid transformations, these methods be-come intractable for a large number of degrees of freedom, e.g., when an objectis allowed to deform perspectively. Nevertheless, one approach for feature-baseddeformable template matching is presented in [6], where the final template is cho-sen from a learning set while the match metric is evaluated. Because obtaining alearning set and applying a learning step is problematic for our applications, weprefer to not rely on training data except for the original template. In contrastto this, we use a match metric that allows for local perspective deformations,while preserving robustness to illumination changes, partial occlusion and clut-ter. While we found a match metric with normalized directed edge points in [7,8] for rigid object detection, and also for articulated object detection in [9], itsadaptation to 3D object detection is new. A novelty in the proposed approachis that the search method takes all search results for all parts into account atthe same time. Despite the fact that the model is decomposed into sub-parts,the relevant size of the model that is used for the search at the highest pyra-mid level is not reduced. Hence, the presented method does not suffer the speedlimitations of a reduced number of pyramid levels that prior art methods have.This is in contrast to, e.g., a component-based detection like in [9] which couldconceptually also be adapted for the perspective object detection. Here smallsub-parts must be detected, leading to a lower number of pyramid levels thatcan be used to speed up the search.

2 Perspective Shape-Based Object Detection

In the following, we detail the perspective shape-based model generation andmatching algorithm. The problem that this algorithm solves is particularly dif-ficult, because in contrast to optical flow, tracking, or medical registration, weassume neither temporal nor local coherence. While the location of the objectsare determined with the robustness of a template matching method, we avoidthe necessity of expanding the full search space as if it was a descriptor-basedmethod.

2.1 Shape Model Generation

For the generation of our model, we decided to rely on the result of a simplecontour edge detection. This allows us to represent objects from template imagesas long as there is any intensity change. Note that in contrast to corners orinterest point features, we can model objects that contain only curved contours(see detection results in Figure 1). Furthermore, directly generating a modelfrom an untextured CAD format is possible in principle. Our shape model M iscomposed of an unordered set of edge points

M ={xi, yi, d

mi , cji, pj |i = 1 . . . n, j = 1 . . . k

}. (1)

Here, x and y are the row and column coordinates of the n model points. dm

denotes the gradient direction vector at the respective row and column coordi-nate of the template. We assume that spatially coherent structures stay the sameeven after a perspective distortion. Therefore, we cluster the model points withexpectation-maximization-based k-means such that every model point belongs toone of k clusters. The indicator matrix cji maps clusters to model points (entry0 if not in cluster, else entry 1) that allow us to access the model points of eachcluster efficiently at run-time. For the later detection we have to distinguish,whether a cluster of a model can be used as a point feature (that gives us twoequations for the x and y location) or only as a contour line feature (that suffersfrom the aperture problem and gives only one equation). This is a label that wesave in for each cluster in pj . We detect this by analyzing whether a part con-tains only one dominant gradient direction or gradient directions into various,e.g., perpendicular directions. For this we determine a feature that describes themain gradient directions for each of the j clusters of the model:

ClusterDirectionj =

∥∥∥∥∥∥

∑ni=1

dmi

‖dmi‖cji

∑ni=1 cji

∥∥∥∥∥∥. (2)

If a model part contains only direction vectors in one dominant direction, thelength of the resulting ClusterDirection vector has the value one or, in case ofnoise, slightly below one. In all other cases, e.g., a corner-like shape or directionsof opposing gradient directions, the length of ClusterDirection is significantlysmaller than one. For a straight edge with a contrast change, the sign changeof the gradient polarity gives us one constraint and hence prevents movementof that cluster along the edge and therefore we assign it a point feature label.Because the model generation relies on only one image and the calculation of theclusters is realized efficiently, this step needs, even for models with thousands ofpoints, less than a second. This is an advantage for users of a computer visionsystem, as an extended offline phase would make the use of a model generationalgorithm cumbersome.

2.2 Metric Based on Local Edge Patches

Given the generated model, the task of the perspective shape matching algorithmis to extract instances of that model in new images. Therefore, we adapted thematch metric of [8]. This match metric is designed such that it is inherentlyinvariant against nonlinear illumination changes, partial occlusion and clutter.If a location of an object is described by x, y (this is 2D translation only, but theformulas can easily be extended for, e.g., 2D affine transformations), the scorefunction for rigid objects reads as follows:

s(x, y) =1n

n∑

i=1

〈dmi , ds

(x+xi,y+yi)〉

‖dmi ‖ · ‖ds

(x+xi,y+yi)‖ , (3)

where ds is the direction vector in the search image, 〈·〉 is the dot productand ‖ · ‖ is the Euclidean norm. The point set of the model is compared to a

dense gradient direction field of the search image. Even with significant nonlin-ear illumination changes that propagate to the gradient amplitude the gradientdirection stays the same. Furthermore, a hysteresis threshold or non-maximumsuppression is completely avoided in the search image, resulting in true invari-ance against arbitrary illumination changes.1 Partial occlusion, noise, and clutterresults in random gradient directions in the search image. These effects lowerthe maximum of the score function but do not alter its location. Hence, thesemantic meaning of the score value is the ratio of matching model points. It isinteresting to note that comparing the cosine between the gradients leads to thesame result, but calculating this formula with dot products is several orders ofmagnitudes faster.

The idea of extending this metric for 3D object detection is that we in-stantiate globally only similarity transformations. By allowing successive smallmovements of the parts of the model, we implicitly evaluate a much higherclass of nonlinear transformations, like perspective distortions. Following thisargument, we distinguish between an explicit global score function sg, which isevaluated for, e.g., affine 2D transformations2, and a local implicit score functionsl, that allows for local deformations. The global score function sg is a sum ofthe contributions of all the clusters.

sg(x, y) =1n

k∑

j=1

sl(x, y, j). (4)

We assume that even after a perspective distortion the neighborhood of eachmodel point stays the same and is approximated by a local euclidean transfor-mation. Hence, we instantiate local euclidean transformations T for each clusterand apply it on the model points of that cluster in a small local neighborhood.If the cluster is a point feature we search in a 5×5 pixel window the opti-mal score. In case of a line feature, we search a 2 pixels in both directions ofClusterDirectionj of the respective cluster. The local score then is the max-imum alignment of gradient direction between the locally transformed modelpoints of each cluster and the search image. Accordingly, the proposed localscore function sl is:

sl(x, y, j) = maxT

size(j)∑

i=1

〈dmcji

, ds(x+T (xcji

),y+T (ycji))〉

‖dmcji

‖ · ‖ds(x+T (xcji

),y+T (ycji))‖

(5)

Here, the function size returns the number of elements in cluster j. For thesake of efficiency, we exploit the mapping that was generated in the offline phasefor accessing the points in each cluster (the cji matrix). Furthermore, we cacheT (xcji) and T (ycji) since they are independent of x and y.

1 Homogenous regions and regions that are below a minimum contrast change (e.g.,less than 3 gray values) can optionally be discarded, as they give random directionsthat are due to noise. However, this is not needed conceptually, but gives a smallspeedup.

2 For sake of clarity we write formulas only for 2D translation. They can easily beextended for, e.g., 2D rotation, scaling, and anisotropic scaling, as is done in ourimplementation.

2.3 Perspective Shape Matching

After defining an efficient score function that tolerates shape changes, we inte-grated it into a general purpose object detection system. We decided to alter theconventional template matching algorithm such that it copes with perspectivelydistorted planar 3D objects.

Fig. 2. The schematic depiction of the shape matching algorithm. The original modelconsists of the rectangle. The first distorted quadrilateral is derived from the parentof the hypothesis. The local displacements T , depicted as arrows bring the warpedtemplate to a displaced position and the fitted homography aligns the parts of themodel again.

Hence, the perspective shape matching algorithm first extracts an imagepyramid of incrementally zoomed down versions of the original search image. Atthe highest pyramid level, we extract the local maxima of the score sg function(4). The local maxima of sg are then tracked through the image pyramid un-til either the lowest pyramid level is reached or no match candidate is above acertain score value. While tracking the candidates down the pyramid, a roughalignment was already extracted during evaluation of the current candidate’sparent on a higher pyramid level. Therefore, we first use the alignment origi-nating from the candidate’s parent to warp the model. Now, starting from thiswarped candidate the local transformation T that give the maximal score givea locally optimal displacement of each cluster to image. Since we are interestedin perspective distortions of the whole objects, we fit a homography with thenormalized DLT algorithm [10] to the locations of the original cluster centersto the displaced cluster centers given T . Depending whether the clusters havebeen classified as point features or as line features each correspondence give astwo or one equation in the DLT matrix3. Then we iteratively warp the whole

3 Typically, the DLT equations are highly overdetermined. We evaluated a solutionwith the SVD as described in [10] or directly with the eigenvalue decomposition ofthe normal equation. Because there is no noticeable robustness difference, despitethe expected quadratically worse condition number for the solution with the eigen-value decomposition, but the version with the eigenvalue decomposition is faster weuse in the following discussion the solution with the normal equations. To give animpression of the difference, in one example sequence the whole object detectionwith SVD runs in 150 ms, whereae with eigenvalue decomposition it runs in 50 ms.

model with the extracted update and refine the homography on each pyramidlevel until the update becomes near to identity or a maximal number of itera-tions is reached. Up to now, the displacement of each part T is discretized upto pixel resolution. However, as the total pose of the object is determined bythe displacements of many clusters, we typically obtain a very precise positionof the objects. To reach a high accuracy and precision that is a requirement inmany computer vision applications, we extract subpixel-precise edge points inthe search image and determine correspondences for each model point. Given thesubpixel precise correspondences, the homography is again iteratively updateduntil convergence. Here, we minimize the distance of the tangent of the modelpoints to the subpixel precise edge point in the image. Hence, each model edgepoint gives us one equation since it is a line feature.

2.4 Perspective Shape Tracking

Once an template is found in an image sequence, we restrict the search space forthe subsequent frames. This assumption is valid in, e.g., many augmented realityscenarios in which the inter-frame rotations can be assumed to be less then 45degrees and the scaling change of the object be less than 0.8 to 1.2. Once, a trackis lost we switch back to detection by expanding the search range to the originalsize. It is important to note that we use the same algorithm for tracking anddetection and only parametrize it differently. Hence, we exploit prior knowledgeto speed up the search if it is available. However, our template matching is notrestricted to tracking alone, like, e.g., [11], but can be used for object detectionwhen needed.

3 Experiments

For the evaluation of the proposed object detection and tracking algorithm, weconducted experiments under various real world conditions.

3.1 Benchmark Test Set

For comparison with other approaches we used sample images from publiclyavailable benchmark datasets4 (see Figure 3). The graffiti sequence is used, forinstance, to evaluate how much perspective distortion a descriptor-based ap-proach can tolerate. The last two depicted images are a challenge for manydescriptor-based approaches.

Another interesting comparison is with the phone test sequence provided in[12]. Here, Lukas-Kanade, SIFT and the method of [12] are reported to sometimesloose the object. The proposed algorithm is able to process the sequence at 60 msper image without once loosing the object. We think this is due to the fact thatwe explicitly represent contour information and not just interest point featuresand because we are able to exhaustively search for the object.4 http://www.robots.ox.ac.uk/˜vgg/data/data-aff.html

Fig. 3. Benchmark data set with detected template used for evaluations of the method.

Fig. 4. Sample images used for different real world experimental evaluations. The fittededge model points are depicted.

3.2 Industrial Robot Experiments

To evaluate the accuracy of the 3D object detection, we equipped a 6 axis in-dustrial robot with a calibrated camera (12 mm lens, 640 × 480 pixel) mountedat its gripper. Then we taught the robot a sequence of different poses wherethe camera–object distance changes in the range of 30-50 cm and significantlatitude changes must be compensated. First, we determined the repeatabilityof the robot, to prevent a drift during different experiment runs. Therefore, wemade the robot drive the sequence several times and determined the pose ofthe camera at different runs with an industrial calibration grid that is seen bythe camera. The maximal difference between the poses of different runs was0.0009 normalized distance error between estimated and ground truth transla-tional position and 0.04 degrees angle to bring the rotation from the estimatedto the true pose. Then, we manually placed an object at the same place as thecalibration grid and used the planar shape matching with a bundle adjustmentpose estimation to determine the pose of the object (see Figure 4). The maximaldifference to the poses that where extracted with the calibration grid was below0.01 normalized distance and 0.4 degree angle. It is interesting to note that theremaining pose error is dominated by the depth uncertainty. The maximal errorsare measured, when the images suffer severe illumination changes or are out offocus. When these situations are prevented the error of translation is below 0.005normalized distance and the angle error below 0.2 degree angle.

Fig. 5. Sample images from a longer sequence used in the experimental evaluation.The first image is used for generating the template of the house. Further movies canbe viewed on the web site: http://campar.in.tum.de/Main/AndreasHofhauser.

3.3 Registering an Untextured Building

To evaluate the method for augmented reality scenarios we used the proposedapproach to estimate the position of a building that is imaged by a shakinghand-held camera (see Figure 5). Despite huge scale changes, motion blur andparallax effects the facade of the building is robustly detected and tracked. Wethink that this is a typical sequence for mobile augmented reality applications,e.g., the remote expert situation. Furthermore, due to the fast model generation,the approach is particularly useful for, e.g., mobile navigation, in which thetemplate for the object detection must be generated instantly.

To sum up the experimental evaluation, we think that the result are very en-couraging in terms of speed, robustness and accuracy compared to other methodsthat have been published in literature. Applications that up to now relied on in-terest points can replace their current object detection without the disadvantageof smaller robustness with regards to, e.g., attitude change. Furthermore, theseapplications can benefit of a bigger robustness and accuracy particularly for de-tecting untextured objects. Since the acquisition of the test sequences was a timeconsuming task and to stimulate further research and comparisons we will makethe test data available upon request.

4 Conclusion

In this paper we presented a single-camera solution for planar 3D templatematching and tracking that can be utilized in a wide range of applications. Forthis, we extended an already existing edge-polarity-based match metric for toler-ating local shape changes. In an extensive evaluation we showed the applicabilityof the method for several computer vision scenarios.

References

1. Ladikos, A., Benhimane, S., Navab, N.: A real-time tracking system combiningtemplate-based and feature-based approaches. In: International Conference onComputer Vision Theory and Applications. (2007)

2. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-tional Journal of Computer Vision (2004)

3. Berg, A., Berg, T., Malik, J.: Shape matching and object recognition using lowdistortion correspondences. In: Conference on Computer Vision and Pattern Recog-nition, San Diego, CA. (2005)

4. Pilet, J., Lepetit, V., Fua, P.: Real-time non-rigid surface detection. In: Conferenceon Computer Vision and Pattern Recognition, San Diego, CA. (2005)

5. Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code.In: Conference on Computer Vision and Pattern Recognition. (2007)

6. Gavrila, D.M., Philomin, V.: Real-time object detection for “smart” vehicles. In:7th International Conference on Computer Vision. Volume I. (1999) 87–93

7. Olson, C.F., Huttenlocher, D.P.: Automatic target recognition by matching ori-ented edge pixels. IEEE Transactions on Image Processing 6 (1997) 103–113

8. Steger, C.: Occlusion, clutter, and illumination invariant object recognition. InKalliany, R., Leberl, F., eds.: International Archives of Photogrammetry, RemoteSensing, and Spatial Information Sciences. Volume XXXIV, part 3A., Graz (2002)345–350

9. Ulrich, M., Baumgartner, A., Steger, C.: Automatic hierarchical object decompo-sition for object recognition. In: International Archives of Photogrammetry andRemote Sensing. Volume XXXIV, part 5. (2002) 99–104

10. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam-bridge University Press, Cambridge (2000)

11. Benhimane, S., Malis, E.: Homography-based 2d visual tracking and servoing. Spe-cial Joint Issue IJCV/IJRR on Robot and Vision. Published in The InternationalJournal of Robotics Research 26 (2007) 661–676

12. K.Zimmermann, J.Matas, T.: Tracking by an optimal sequence of linear predictors.Transactions on Pattern Analysis and Machine Intelligence (to appear) (2008)

edge-based template matching and tracking for ... · edge-based template matching and tracking for...

Documents