speeded-uprobustfeatures(surf)misha/readingseminar/papers/bay08.pdf ·...

14
Speeded-Up Robust Features (SURF) Herbert Bay a , Andreas Ess a , Tinne Tuytelaars b , and Luc Van Gool a,b a ETH Zurich, BIWI Sternwartstrasse 7 CH-8092 Zurich Switzerland b K. U. Leuven, ESAT-PSI Kasteelpark Arenberg 10 B-3001 Leuven Belgium Abstract This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effect of the most important parameters. We conclude the article with SURF’s application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF’s usefulness in a broad range of topics in computer vision. Key words: interest points, local features, feature description, camera calibration, object recognition PACS: 1. Introduction The task of finding point correspondences between two images of the same scene or object is part of many computer vision applications. Image registration, camera calibration, object recognition, and image retrieval are just a few. The search for discrete image point correspondences can be divided into three main steps. First, ‘interest points’ are selected at distinctive locations in the image, such as corners, blobs, and T-junctions. The most valuable prop- erty of an interest point detector is its repeatability. The repeatability expresses the reliability of a detector for find- ing the same physical interest points under different view- ing conditions. Next, the neighbourhood of every interest point is represented by a feature vector. This descriptor has to be distinctive and at the same time robust to noise, de- tection displacements and geometric and photometric de- formations. Finally, the descriptor vectors are matched be- tween different images. The matching is based on a distance between the vectors, e.g. the Mahalanobis or Euclidean dis- tance. The dimension of the descriptor has a direct impact on the time this takes, and less dimensions are desirable for fast interest point matching. However, lower dimensional feature vectors are in general less distinctive than their high-dimensional counterparts. It has been our goal to develop both a detector and de- scriptor that, in comparison to the state-of-the-art, are fast to compute while not sacrificing performance. In order to succeed, one has to strike a balance between the above requirements like simplifying the detection scheme while keeping it accurate, and reducing the descriptor’s size while keeping it sufficiently distinctive. A wide variety of detectors and descriptors have already been proposed in the literature (e.g. [21,24,27,37,39,25]). Also, detailed comparisons and evaluations on benchmark- ing datasets have been performed [28,30,31]. Our fast de- tector and descriptor, called SURF (Speeded-Up Robust Features), was introduced in [4]. It is built on the insights Preprint submitted to Elsevier 10 September 2008

Upload: dangkhanh

Post on 08-Nov-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

Speeded-Up Robust Features (SURF)

Herbert Bay a , Andreas Ess a , Tinne Tuytelaars b , and Luc Van Gool a,b

aETH Zurich, BIWI

Sternwartstrasse 7CH-8092 Zurich

SwitzerlandbK. U. Leuven, ESAT-PSI

Kasteelpark Arenberg 10

B-3001 LeuvenBelgium

Abstract

This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features).SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness,yet can be computed and compared much faster.

This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectorsand descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and bysimplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps.

The paper encompasses a detailed description of the detector and descriptor and then explores the effect of the most importantparameters. We conclude the article with SURF’s application to two challenging, yet converse goals: camera calibration as a specialcase of image registration, and object recognition. Our experiments underline SURF’s usefulness in a broad range of topics incomputer vision.

Key words: interest points, local features, feature description, camera calibration, object recognition

PACS:

1. Introduction

The task of finding point correspondences between twoimages of the same scene or object is part of many computervision applications. Image registration, camera calibration,object recognition, and image retrieval are just a few.

The search for discrete image point correspondences canbe divided into three main steps. First, ‘interest points’are selected at distinctive locations in the image, such ascorners, blobs, and T-junctions. The most valuable prop-erty of an interest point detector is its repeatability. Therepeatability expresses the reliability of a detector for find-ing the same physical interest points under different view-ing conditions. Next, the neighbourhood of every interestpoint is represented by a feature vector. This descriptor hasto be distinctive and at the same time robust to noise, de-tection displacements and geometric and photometric de-formations. Finally, the descriptor vectors are matched be-tween different images. The matching is based on a distance

between the vectors, e.g. the Mahalanobis or Euclidean dis-tance. The dimension of the descriptor has a direct impacton the time this takes, and less dimensions are desirable forfast interest point matching. However, lower dimensionalfeature vectors are in general less distinctive than theirhigh-dimensional counterparts.

It has been our goal to develop both a detector and de-scriptor that, in comparison to the state-of-the-art, are fastto compute while not sacrificing performance. In order tosucceed, one has to strike a balance between the aboverequirements like simplifying the detection scheme whilekeeping it accurate, and reducing the descriptor’s size whilekeeping it sufficiently distinctive.

A wide variety of detectors and descriptors have alreadybeen proposed in the literature (e.g. [21,24,27,37,39,25]).Also, detailed comparisons and evaluations on benchmark-ing datasets have been performed [28,30,31]. Our fast de-tector and descriptor, called SURF (Speeded-Up RobustFeatures), was introduced in [4]. It is built on the insights

Preprint submitted to Elsevier 10 September 2008

Page 2: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

gained from this previous work. In our experiments on thesebenchmarking datasets, SURF’s detector and descriptorare not only faster, but the former is also more repeatableand the latter more distinctive.

We focus on scale and in-plane rotation invariant detec-tors and descriptors. These seem to offer a good compromisebetween feature complexity and robustness to commonlyoccurring deformations. Skew, anisotropic scaling, and per-spective effects are assumed to be second-order effects, thatare covered to some degree by the overall robustness of thedescriptor. Note that the descriptor can be extended to-wards affine invariant regions using affine normalisation ofthe ellipse (cf. [31]), although this will have an impact onthe computation time. Extending the detector, on the otherhand, is less straight-forward. Concerning the photometricdeformations, we assume a simple linear model with a bias(offset) and contrast change (scale factor). Neither detectornor descriptor use colour information.

In section 3, we describe the strategy applied for fast androbust interest point detection. The input image is anal-ysed at different scales in order to guarantee invariance toscale changes. The detected interest points are providedwith a rotation and scale invariant descriptor in section 4.Furthermore, a simple and efficient first-line indexing tech-nique, based on the contrast of the interest point with itssurrounding, is proposed.

In section 5, some of the available parameters and theireffects are discussed, including the benefits of an uprightversion (not invariant to image rotation). We also inves-tigate SURF’s performance in two important applicationscenarios. First, we consider a special case of image regis-tration, namely the problem of camera calibration for 3Dreconstruction. Second, we will explore SURF’s applica-tion to an object recognition experiment. Both applicationshighlight SURF’s benefits in terms of speed and robustnessas opposed to other strategies. The article is concluded insection 6.

2. Related Work

2.1. Interest Point Detection

The most widely used detector is probably the Harris cor-ner detector [15], proposed back in 1988. It is based on theeigenvalues of the second moment matrix. However, Harriscorners are not scale-invariant. Lindeberg [21] introducedthe concept of automatic scale selection. This allows to de-tect interest points in an image, each with their own char-acteristic scale. He experimented with both the determi-nant of the Hessian matrix as well as the Laplacian (whichcorresponds to the trace of the Hessian matrix) to detectblob-like structures. Mikolajczyk and Schmid [26] refinedthis method, creating robust and scale-invariant feature de-tectors with high repeatability, which they coined Harris-Laplace and Hessian-Laplace. They used a (scale-adapted)Harris measure or the determinant of the Hessian matrix

to select the location, and the Laplacian to select the scale.Focusing on speed, Lowe [23] proposed to approximate theLaplacian of Gaussians (LoG) by a Difference of Gaussians(DoG) filter.

Several other scale-invariant interest point detectorshave been proposed. Examples are the salient region detec-tor, proposed by Kadir and Brady [17], which maximisesthe entropy within the region, and the edge-based regiondetector proposed by Jurie and Schmid [16]. They seemless amenable to acceleration though. Also several affine-invariant feature detectors have been proposed that cancope with wider viewpoint changes. However, these falloutside the scope of this article.

From studying the existing detectors and from publishedcomparisons [29,30], we can conclude that Hessian-baseddetectors are more stable and repeatable than their Harris-based counterparts. Moreover, using the determinant of theHessian matrix rather than its trace (the Laplacian) seemsadvantageous, as it fires less on elongated, ill-localisedstructures. We also observed that approximations like theDoG can bring speed at a low cost in terms of lost accuracy.

2.2. Interest Point Description

An even larger variety of feature descriptors has beenproposed, like Gaussian derivatives [11], moment invariants[32], complex features [1,36], steerable filters [12], phase-based local features [6], and descriptors representing thedistribution of smaller-scale features within the interestpoint neighbourhood. The latter, introduced by Lowe [24],have been shown to outperform the others [28]. This canbe explained by the fact that they capture a substantialamount of information about the spatial intensity patterns,while at the same time being robust to small deformationsor localisation errors. The descriptor in [24], called SIFTfor short, computes a histogram of local oriented gradi-ents around the interest point and stores the bins in a 128-dimensional vector (8 orientation bins for each of 4× 4 lo-cation bins).

Various refinements on this basic scheme have been pro-posed. Ke and Sukthankar [18] applied PCA on the gradientimage around the detected interest point. This PCA-SIFTyields a 36-dimensional descriptor which is fast for match-ing, but proved to be less distinctive than SIFT in a sec-ond comparative study by Mikolajczyk [30]; and applyingPCA slows down feature computation. In the same paper[30], the authors proposed a variant of SIFT, called GLOH,which proved to be even more distinctive with the samenumber of dimensions. However, GLOH is computationallymore expensive as it uses again PCA for data compression.

The SIFT descriptor still seems the most appealing de-scriptor for practical uses, and hence also the most widelyused nowadays. It is distinctive and relatively fast, whichis crucial for on-line applications. Recently, Se et al. [37]implemented SIFT on a Field Programmable Gate Array(FPGA) and improved its speed by an order of magni-

2

Page 3: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

tude. Meanwhile, Grabner et al. [14] also used integral im-ages to approximate SIFT. Their detection step is based ondifference-of-mean (without interpolation), their descrip-tion step on integral histograms. They achieve about thesame speed as we do (though the description step is con-stant in speed), but at the cost of reduced quality com-pared to SIFT. Generally, the high dimensionality of thedescriptor is a drawback of SIFT at the matching step. Foron-line applications relying only on a regular PC, each oneof the three steps (detection, description, matching) has tobe fast.

An entire body of work is available on speeding up thematching step. All of them come at the expense of gettingan approximative matching. Methods include the best-bin-first proposed by Lowe [24], balltrees [35], vocabulary trees[34], locality sensitive hashing [9], or redundant bit vectors[13]. Complementary to this, we suggest the use of the Hes-sian matrix’s trace to significantly increase the matchingspeed. Together with the descriptor’s low dimensionality,any matching algorithm is bound to perform faster.

3. Interest Point Detection

Our approach for interest point detection uses a verybasic Hessian-matrix approximation. This lends itself tothe use of integral images as made popular by Viola andJones [41], which reduces the computation time drastically.Integral images fit in the more general framework of boxlets,as proposed by Simard et al. [38].

3.1. Integral Images

In order to make the article more self-contained, webriefly discuss the concept of integral images. They allowfor fast computation of box type convolution filters. Theentry of an integral image IΣ(x) at a location x = (x, y)>

represents the sum of all pixels in the input image I withina rectangular region formed by the origin and x.

IΣ(x) =i≤x∑i=0

j≤y∑j=0

I(i, j) (1)

Once the integral image has been computed, it takesthree additions to calculate the sum of the intensities overany upright, rectangular area (see figure 1). Hence, the cal-culation time is independent of its size. This is importantin our approach, as we use big filter sizes.

3.2. Hessian Matrix Based Interest Points

We base our detector on the Hessian matrix because ofits good performance in accuracy. More precisely, we detectblob-like structures at locations where the determinant ismaximum. In contrast to the Hessian-Laplace detector byMikolajczyk and Schmid [26], we rely on the determinant

Fig. 1. Using integral images, it takes only three additions and four

memory accesses to calculate the sum of intensities inside a rectan-gular region of any size.

of the Hessian also for the scale selection, as done by Lin-deberg [21].

Given a point x = (x, y) in an image I, the Hessianmatrix H(x, σ) in x at scale σ is defined as follows

H(x, σ) =

Lxx(x, σ) Lxy(x, σ)

Lxy(x, σ) Lyy(x, σ)

, (2)

where Lxx(x, σ) is the convolution of the Gaussian secondorder derivative ∂2

∂x2 g(σ) with the image I in point x, andsimilarly for Lxy(x, σ) and Lyy(x, σ).

Gaussians are optimal for scale-space analysis [19,20],but in practice they have to be discretised and cropped (fig-ure 2 left half). This leads to a loss in repeatability underimage rotations around odd multiples of π4 . This weaknessholds for Hessian-based detectors in general. Figure 3 showsthe repeatability rate of two detectors based on the Hessianmatrix for pure image rotation. The repeatability attains amaximum around multiples of π2 . This is due to the squareshape of the filter. Nevertheless, the detectors still performwell, and the slight decrease in performance does not out-weigh the advantage of fast convolutions brought by thediscretisation and cropping. As real filters are non-ideal inany case, and given Lowe’s success with his LoG approxi-mations, we push the approximation for the Hessian matrixeven further with box filters (in the right half of figure 2).These approximate second order Gaussian derivatives andcan be evaluated at a very low computational cost usingintegral images. The calculation time therefore is indepen-dent of the filter size. As shown in the results section andfigure 3, the performance is comparable or better than withthe discretised and cropped Gaussians.

Fig. 2. Left to right: the (discretised and cropped) Gaussian sec-ond order partial derivative in y- (Lyy) and xy-direction (Lxy), re-spectively; our approximation for the second order Gaussian partialderivative in y- (Dyy) and xy-direction (Dxy). The grey regions areequal to zero.

3

Page 4: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

The 9× 9 box filters in figure 2 are approximations of aGaussian with σ = 1.2 and represent the lowest scale (i.e.highest spatial resolution) for computing the blob responsemaps. We will denote them by Dxx, Dyy, and Dxy. Theweights applied to the rectangular regions are kept simplefor computational efficiency. This yields

det(Happrox) = DxxDyy − (wDxy)2. (3)

The relative weight w of the filter responses is used to bal-ance the expression for the Hessian’s determinant. This isneeded for the energy conservation between the Gaussiankernels and the approximated Gaussian kernels,

w =|Lxy(1.2)|F |Dyy(9)|F|Lyy(1.2)|F |Dxy(9)|F

= 0.912... ' 0.9, (4)

where |x|F is the Frobenius norm. Notice that for theoret-ical correctness, the weighting changes depending on thescale. In practice, we keep this factor constant, as this didnot have a significant impact on the results in our experi-ments.

Furthermore, the filter responses are normalised with re-spect to their size. This guarantees a constant Frobeniusnorm for any filter size, an important aspect for the scalespace analysis as discussed in the next section.

The approximated determinant of the Hessian representsthe blob response in the image at location x. These re-sponses are stored in a blob response map over differentscales, and local maxima are detected as explained in sec-tion 3.4.

20 40 60 80 100 120 140 1600

20

40

60

80

100

image rotation

repe

atab

ility

%

Fast−HessianHessian−Laplace

Fig. 3. Top: Repeatability score for image rotation of up to 180degrees. Hessian-based detectors have in general a lower repeatability

score for angles around uneven multiples of π4

. Bottom: Sample

images from the Van Gogh sequence that was used. Fast-Hessian isthe more accurate version of our detector (FH-15), as explained in

section 3.3.

3.3. Scale Space Representation

Interest points need to be found at different scales, notleast because the search of correspondences often requires

their comparison in images where they are seen at differentscales. Scale spaces are usually implemented as an imagepyramid. The images are repeatedly smoothed with a Gaus-sian and then sub-sampled in order to achieve a higher levelof the pyramid. Lowe [24] subtracts these pyramid layersin order to get the DoG (Difference of Gaussians) imageswhere edges and blobs can be found.

Due to the use of box filters and integral images, we donot have to iteratively apply the same filter to the outputof a previously filtered layer, but instead can apply box fil-ters of any size at exactly the same speed directly on theoriginal image and even in parallel (although the latter isnot exploited here). Therefore, the scale space is analysedby up-scaling the filter size rather than iteratively reducingthe image size, figure 4. The output of the 9 × 9 filter, in-troduced in the previous section, is considered as the initialscale layer, to which we will refer as scale s = 1.2 (approx-imating Gaussian derivatives with σ = 1.2). The followinglayers are obtained by filtering the image with graduallybigger masks, taking into account the discrete nature of in-tegral images and the specific structure of our filters.

Note that our main motivation for this type of samplingis its computational efficiency. Furthermore, as we do nothave to downsample the image, there is no aliasing. On thedownside, box filters preserve high-frequency componentsthat can get lost in zoomed-out variants of the same scene,which can limit scale-invariance. This was however not no-ticeable in our experiments.

Fig. 4. Instead of iteratively reducing the image size (left), the useof integral images allows the up-scaling of the filter at constant cost

(right).

The scale space is divided into octaves. An octave repre-sents a series of filter response maps obtained by convolv-ing the same input image with a filter of increasing size. Intotal, an octave encompasses a scaling factor of 2 (whichimplies that one needs to more than double the filter size,see below). Each octave is subdivided into a constant num-ber of scale levels. Due to the discrete nature of integral im-ages, the minimum scale difference between 2 subsequentscales depends on the length l0 of the positive or negativelobes of the partial second order derivative in the directionof derivation (x or y), which is set to a third of the filtersize length. For the 9× 9 filter, this length l0 is 3. For twosuccessive levels, we must increase this size by a minimumof 2 pixels (one pixel on every side) in order to keep the sizeuneven and thus ensure the presence of the central pixel.This results in a total increase of the mask size by 6 pix-els (see figure 5). Note that for dimensions different from

4

Page 5: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

l0 (e.g. the width of the central band for the vertical filterin figure 5), rescaling the mask introduces rounding-off er-rors. However, since these errors are typically much smallerthan l0, this is an acceptable approximation.

Fig. 5. Filters Dyy (top) and Dxy (bottom) for two successive scale

levels (9 × 9 and 15 × 15). The length of the dark lobe can onlybe increased by an even number of pixels in order to guarantee the

presence of a central pixel (top).

The construction of the scale space starts with the 9× 9filter, which calculates the blob response of the image forthe smallest scale. Then, filters with sizes 15× 15, 21× 21,and 27 × 27 are applied, by which even more than a scalechange of 2 has been achieved. But this is needed, as a 3Dnon-maximum suppression is applied both spatially andover the neighbouring scales. Hence, the first and last Hes-sian response maps in the stack cannot contain such max-ima themselves, as they are used for reasons of compari-son only. Therefore, after interpolation, see section 3.4, thesmallest possible scale is σ = 1.6 = 1.2 12

9 corresponding toa filter size of 12× 12, and the highest to σ = 3.2 = 1.2 24

9 .For more details, we refer to [2].

Similar considerations hold for the other octaves. Foreach new octave, the filter size increase is doubled (goingfrom 6 to 12 to 24 to 48). At the same time, the samplingintervals for the extraction of the interest points can bedoubled as well for every new octave. This reduces the com-putation time and the loss in accuracy is comparable to theimage sub-sampling of the traditional approaches. The fil-ter sizes for the second octave are 15, 27, 39, 51. A thirdoctave is computed with the filter sizes 27, 51, 75, 99 and,if the original image size is still larger than the correspond-ing filter sizes, the scale space analysis is performed for afourth octave, using the filter sizes 51, 99, 147, and 195. Fig-ure 6 gives an overview of the filter sizes for the first threeoctaves. Note that more octaves may be analysed, but thenumber of detected interest points per octave decays veryquickly, cf. figure 7.

The large scale changes, especially between the first fil-ters within these octaves (from 9 to 15 is a change of 1.7),

Fig. 6. Graphical representation of the filter side lengths for three dif-

ferent octaves. The logarithmic horizontal axis represents the scales.

Note that the octaves are overlapping in order to cover all possiblescales seamlessly.

renders the sampling of scales quite crude. Therefore, wehave also implemented a scale space with a finer samplingof the scales. This first doubles the size of the image, usinglinear interpolation, and then starts the first octave by fil-tering with a filter of size 15. Additional filter sizes are 21,27, 33, and 39. Then a second octave starts, again using fil-ters which now increase their sizes by 12 pixels, after whicha third and fourth octave follow. Now the scale change be-tween the first two filters is only 1.4 (21/15). The lowestscale for the accurate version that can be detected throughquadratic interpolation is s = (1.2 18

9 )/2 = 1.2.As the Frobenius norm remains constant for our filters at

any size, they are already scale normalised, and no furtherweighting of the filter response is required, see [22].

3.4. Interest Point Localisation

In order to localise interest points in the image and overscales, a non-maximum suppression in a 3 × 3 × 3 neigh-bourhood is applied. Specifically, we use a fast variant in-troduced by Neubeck and Van Gool [33]. The maxima ofthe determinant of the Hessian matrix are then interpo-lated in scale and image space with the method proposedby Brown et al. [5].

Scale space interpolation is especially important in ourcase, as the difference in scale between the first layers of

0 10 20 30 40 50 600

50

100

150

200

250

300

350

scale

# de

tect

ed in

tere

st p

oint

s

Fig. 7. Histogram of the detected scales. The number of detected

interest points per octave decays quickly.

5

Page 6: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

every octave is relatively large. Figure 8 shows an exam-ple of the detected interest points using our ‘Fast-Hessian’detector.

Fig. 8. Detected interest points for a Sunflower field. This kind of

scenes shows the nature of the features obtained using Hessian-based

detectors.

4. Interest Point Description and Matching

Our descriptor describes the distribution of the intensitycontent within the interest point neighbourhood, similarto the gradient information extracted by SIFT [24] and itsvariants. We build on the distribution of first order Haarwavelet responses in x and y direction rather than the gra-dient, exploit integral images for speed, and use only 64dimensions. This reduces the time for feature computationand matching, and has proven to simultaneously increasethe robustness. Furthermore, we present a new indexingstep based on the sign of the Laplacian, which increases notonly the robustness of the descriptor, but also the match-ing speed (by a factor of two in the best case). We referto our detector-descriptor scheme as SURF – Speeded-UpRobust Features.

The first step consists of fixing a reproducible orientationbased on information from a circular region around theinterest point. Then, we construct a square region alignedto the selected orientation and extract the SURF descriptorfrom it. Finally, features are matched between two images.These three steps are explained in the following.

4.1. Orientation Assignment

In order to be invariant to image rotation, we identifya reproducible orientation for the interest points. For thatpurpose, we first calculate the Haar wavelet responses in xand y direction within a circular neighbourhood of radius6s around the interest point, with s the scale at whichthe interest point was detected.The sampling step is scaledependent and chosen to be s. In keeping with the rest,also the size of the wavelets are scale dependent and setto a side length of 4s. Therefore, we can again use integralimages for fast filtering. The used filters are shown in figure

9. Only six operations are needed to compute the responsein x or y direction at any scale.

Fig. 9. Haar wavelet filters to compute the responses in x (left) andy direction (right). The dark parts have the weight −1 and the light

parts +1.

Once the wavelet responses are calculated and weightedwith a Gaussian (σ = 2s) centred at the interest point, theresponses are represented as points in a space with the hor-izontal response strength along the abscissa and the ver-tical response strength along the ordinate. The dominantorientation is estimated by calculating the sum of all re-sponses within a sliding orientation window of size π

3 , seefigure 10. The horizontal and vertical responses within thewindow are summed. The two summed responses then yielda local orientation vector. The longest such vector over allwindows defines the orientation of the interest point. Thesize of the sliding window is a parameter which had to bechosen carefully. Small sizes fire on single dominating gra-dients, large sizes tend to yield maxima in vector lengththat are not outspoken. Both result in a misorientation ofthe interest point.

Fig. 10. Orientation assignment: A sliding orientation window of sizeπ3

detects the dominant orientation of the Gaussian weighted Haar

wavelet responses at every sample point within a circular neighbour-hood around the interest point.

Note that for many applications, rotation invariance isnot necessary. Experiments of using the upright versionof SURF (U-SURF, for short) for object detection can befound in [3,4]. U-SURF is faster to compute and can in-crease distinctivity, while maintaining a robustness to ro-tation of about +/− 15◦.

4.2. Descriptor based on Sum of Haar Wavelet Responses

For the extraction of the descriptor, the first step consistsof constructing a square region centred around the interest

6

Page 7: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

point and oriented along the orientation selected in theprevious section. The size of this window is 20s. Examplesof such square regions are illustrated in figure 11.

Fig. 11. Detail of the Graffiti scene showing the size of the oriented

descriptor window at different scales.

The region is split up regularly into smaller 4× 4 squaresub-regions. This preserves important spatial information.For each sub-region, we compute Haar wavelet responsesat 5×5 regularly spaced sample points. For reasons of sim-plicity, we call dx the Haar wavelet response in horizontaldirection and dy the Haar wavelet response in vertical di-rection (filter size 2s), see figure 9 again. “Horizontal” and“vertical” here is defined in relation to the selected inter-est point orientation (see figure 12). 1 To increase the ro-bustness towards geometric deformations and localisationerrors, the responses dx and dy are first weighted with aGaussian (σ = 3.3s) centred at the interest point.

Fig. 12. To build the descriptor, an oriented quadratic grid with

4×4 square sub-regions is laid over the interest point (left). For each

square, the wavelet responses are computed. The 2× 2 sub-divisionsof each square correspond to the actual fields of the descriptor.

These are the sums dx, |dx|, dy, and |dy|, computed relatively to theorientation of the grid (right).

Then, the wavelet responses dx and dy are summedup over each sub-region and form a first set of entries

1 For efficiency reasons, the Haar wavelets are calculated in theunrotated image and the responses are then interpolated, instead of

actually rotating the image.

in the feature vector. In order to bring in informationabout the polarity of the intensity changes, we also ex-tract the sum of the absolute values of the responses, |dx|and |dy|. Hence, each sub-region has a four-dimensionaldescriptor vector v for its underlying intensity structurev = (

∑dx,

∑dy,

∑|dx|,

∑|dy|). Concatenating this for

all 4 × 4 sub-regions, this results in a descriptor vector oflength 64. The wavelet responses are invariant to a bias inillumination (offset). Invariance to contrast (a scale factor)is achieved by turning the descriptor into a unit vector.

Fig. 13. The descriptor entries of a sub-region represent the natureof the underlying intensity pattern. Left: In case of a homogeneous

region, all values are relatively low. Middle: In presence of frequenciesin x direction, the value of

∑|dx| is high, but all others remain low.

If the intensity is gradually increasing in x direction, both values∑dx and

∑|dx| are high.

Figure 13 shows the properties of the descriptor for threedistinctively different image intensity patterns within asub-region. One can imagine combinations of such local in-tensity patterns, resulting in a distinctive descriptor.

SURF is, up to some point, similar in concept as SIFT,in that they both focus on the spatial distribution of gradi-ent information. Nevertheless, SURF outperforms SIFT inpractically all cases, as shown in Section 5. We believe thisis due to the fact that SURF integrates the gradient infor-mation within a subpatch, whereas SIFT depends on theorientations of the individual gradients. This makes SURFless sensitive to noise, as illustrated in the example of Fig-ure 14.

Fig. 14. Due to the global integration of SURF’s descriptor, it stays

more robust to various image perturbations than the more locallyoperating SIFT descriptor.

In order to arrive at these SURF descriptors, we exper-imented with fewer and more wavelet features, second or-der derivatives, higher-order wavelets, PCA, median val-ues, average values, etc. From a thorough evaluation, the

7

Page 8: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

proposed sets turned out to perform best. We then variedthe number of sample points and sub-regions. The 4 × 4sub-region division solution provided the best results, seealso section 5. Considering finer subdivisions appeared tobe less robust and would increase matching times too much.On the other hand, the short descriptor with 3 × 3 sub-regions (SURF-36) performs slightly worse, but allows forvery fast matching and is still acceptable in comparison toother descriptors in the literature.

We also tested an alternative version of the SURF de-scriptor that adds a couple of similar features (SURF-128).It again uses the same sums as before, but now splits thesevalues up further. The sums of dx and |dx| are computedseparately for dy < 0 and dy ≥ 0. Similarly, the sums of dyand |dy| are split up according to the sign of dx, therebydoubling the number of features. The descriptor is moredistinctive and not much slower to compute, but slower tomatch due to its higher dimensionality.

4.3. Fast Indexing for Matching

For fast indexing during the matching stage, the sign ofthe Laplacian (i.e. the trace of the Hessian matrix) for theunderlying interest point is included. Typically, the inter-est points are found at blob-type structures. The sign of theLaplacian distinguishes bright blobs on dark backgroundsfrom the reverse situation. This feature is available at noextra computational cost as it was already computed dur-ing the detection phase. In the matching stage, we onlycompare features if they have the same type of contrast, seefigure 15. Hence, this minimal information allows for fastermatching, without reducing the descriptor’s performance.Note that this is also of advantage for more advanced in-dexing methods. E.g. for k-d trees, this extra informationdefines a meaningful hyperplane for splitting the data, asopposed to randomly choosing an element or using featurestatistics.

Fig. 15. If the contrast between two interest points is different (dark

on light background vs. light on dark background), the candidate is

not considered a valuable match.

5. Results

The following presents both simulated as well as real-world results. First, we evaluate the effect of some param-eter settings and show the overall performance of our de-tector and descriptor based on a standard evaluation set.

detector threshold nb of points comp. time (ms)

FH-15 60000 1813 160

FH-9 50000 1411 70

Hessian-Laplace 1000 1979 700

Harris-Laplace 2500 1664 2100

DoG default 1520 400

Table 1Thresholds, number of detected points and calculation time for thedetectors in our comparison. (First image of Graffiti scene, 800× 640)

Then, we describe two possible applications. For a detailedcomparative study with other detectors and descriptors, werefer to [4]. SURF has already been tested in a few real-world applications. For object detection, its performancehas been illustrated in [3]. Cattin et al. [7] use SURF formosaicing human retina images—a task that no other de-tector/descriptor scheme was able to cope with. Taking thisapplication to image registration a bit further, we focus inthis article on the more difficult problem of camera cali-bration and 3D reconstruction, also in wide-baseline cases.SURF manages to calibrate the cameras even in challeng-ing cases reliably and accurately. Lastly, we investigate theapplication of SURF to the task of object recognition.

5.1. Experimental Evaluation and Parameter Settings

We tested our detector using the image sequences andtesting software provided by Mikolajczyk 2 . The evaluationcriterion is the repeatability score.

The test sequences comprise images of real textured andstructured scenes. There are different types of geometricand photometric transformations, like changing view-points, zoom and rotation, image blur, lighting changesand JPEG compression.

In all experiments reported in this paper, the timingswere measured on a standard PC Pentium IV, running at3 GHz.

5.1.1. SURF DetectorWe tested two versions of our Fast-Hessian detector, de-

pending on the initial Gaussian derivative filter size. FH-9stands for our Fast Hessian detector with the initial filtersize 9×9, and FH-15 is the 15×15 filter on the double inputimage size version. Apart from this, for all the experimentsshown in this section, the same thresholds and parameterswere used.

The detector is compared to the Difference of Gaussians(DoG) detector by Lowe [24], and the Harris- and Hessian-Laplace detectors proposed by Mikolajczyk [29]. The num-ber of interest points found is on average very similar for alldetectors (see table 1 for an example). The thresholds wereadapted according to the number of interest points foundwith the DoG detector.

2 http://www.robots.ox.ac.uk/~vgg/research/affine/

8

Page 9: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

The FH-9 detector is more than five times faster thanDoG and ten times faster than Hessian-Laplace. The FH-15 detector is more than three times faster than DoG andmore than four times faster than Hessian-Laplace (see alsotable 1). At the same time, the repeatability scores for ourdetectors are comparable or even better than for the com-petitors.

The repeatability scores for the Graffiti sequence (fig-ure 16 top) are comparable for all detectors. The repeata-bility score of the FH-15 detector for the Wall sequence(figure 16 bottom) outperforms the competitors. Note thatthe sequences Graffiti and Wall contain out-of-plane rota-tion, resulting in affine deformations, while the detectorsin the comparison are only invariant to image rotation andscale. Hence, these deformations have to be accounted forby the overall robustness of the features. In the Boat se-quence (figure 17 top), the FH-15 detector shows again abetter performance than the others. The FH-9 and FH-15 detectors are outperforming the others in the Bikes se-quence (figure 17 bottom). The superiority and accuracyof our detector is further underlined in the sections 5.2 and5.3.

20 25 30 35 40 45 50 55 600

20

40

60

80

100

viewpoint angle

repe

atab

ility

%

FH−15FH−9DoGHarris−LaplaceHessian−Laplace

20 25 30 35 40 45 50 55 600

20

40

60

80

100

viewpoint angle

repe

atab

ility

%

FH−15FH−9DoGHarris−LaplaceHessian−Laplace

Fig. 16. Repeatability score for the Graffiti (top) and Wall (bottom)sequence (Viewpoint Change).

1 1.5 2 2.5 30

20

40

60

80

100

scale change

repe

atab

ility

%

FH−15FH−9DoGHarris−LaplaceHessian−Laplace

2 2.5 3 3.5 4 4.5 5 5.5 60

20

40

60

80

100

increasing blur

repe

atab

ility

%

FH−15FH−9DoGHarris−LaplaceHessian−Laplace

Fig. 17. Repeatability score for the Boat (top) and Bikes (bottom)

sequence (Scale Change, Image Blur).

5.1.2. SURF DescriptorHere, we focus on two options offered by the SURF de-

scriptor and their effect on recall/precision.Firstly, the number of divisions of the square grid in fig-

ure 12, and hence the descriptor size, has a major impacton the matching speed. Secondly, we consider the extendeddescriptor as described above. Figure 18 plots recall andprecision against the side length of the square grid, bothfor the standard as well as the extended descriptor. Onlythe number of divisions is varied, not the actual size of theparent square. SURF-36 refers to a grid of 3× 3, SURF-72to its extended counterpart. Likewise, SURF-100 refers to5× 5 and SURF-144 to 6× 6, with SURF-200 and SURF-288 their extended versions. To get averaged numbers overmultiple images (we chose one pair from each set of testimages), the ratio-matching scheme [24] is used.

Clearly, a square of size 4 × 4 performs best both withrespect to recall and precision in both cases. Still, 3×3 is aviable alternative as well, especially when matching speedis of importance. From further analysis, we discovered thatthe extended descriptor loses with respect to recall, butexhibits better precision. Overall, the effect of the extendedversion is minimal.

9

Page 10: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−precision

reca

ll

SURF−16SURF−36SURF−64SURF−100SURF−144

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−precision

reca

ll

SURF−32SURF−72SURF−128SURF−200SURF−288

Fig. 18. Recall-precision for nearest neighbor ratio matching for vary-

ing side length of square grid. A maximum is attained for a square of

4×4. Figures averaged over 8 image pairs of Mikolajczyk’s database.Top: standard descriptor, bottom: extended descriptor.

Extensive comparisons with other descriptors can befound in [4]. Here, we only show a comparison with twoother prominent description schemes (SIFT [24] and GLOH[30]), again averaged over the test sequences (Figure 19).SURF-64 turns out to perform best.

Another major advantage of SURF is its low computa-tion time: detection and description of 1529 interest pointstakes about 610 ms, the upright version U-SURF uses amere 400 ms. (first Graffiti image; Pentium 4, 3 GHz)

5.2. Application to 3D

In this section, we evaluate the accuracy of our Fast-Hessian detector for the application of camera self-calibration and 3D reconstruction. The first evaluationcompares different state-of-the-art interest point detectorsfor the two-view case. A known scene is used to providesome quantitative results. The second evaluation considersthe N -view case for camera self-calibration and dense 3Dreconstruction from multiple images, some taken underwide-baseline conditions.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−precision

reca

ll

SURF−64SIFTGLOH

Fig. 19. Recall-precision for nearest neighbor ratio matching for dif-ferent description schemes, evaluated on SURF keypoints. Figures

averaged over 8 image pairs of Mikolajczyk’s database.

5.2.1. 2-view CaseIn order to evaluate the performance of different interest

point detection schemes for camera calibration and 3D re-construction, we created a controlled environment. A goodscene for such an evaluation are two highly textured planesforming a right angle (measured 88.6◦ in our case), see fig-ure 20. The images are of size 800×600. Principal point andaspect ratio are known. As the number of correct matchesis an important factor for the accuracy, we adjusted the in-terest point detectors’ parameters so that after matching,we are left with 800 correct matches (matches not belong-ing to the angle are filtered). The SURF-128 descriptor wasused for the matching step. The location of the two planeswas evaluated using RANSAC, followed by orthogonal re-gression. The evaluation criteria are the angle between thetwo planes, as well as the mean distance and the varianceof the reconstructed 3D points to their respective planesfor different interest point detectors.

Table 2 shows these quantitative results for our two ver-sions of the Fast-Hessian detector (FH-9 and FH-15), theDoG features of SIFT [24], and the Hessian- and Harris-Laplace detectors proposed by Mikolajczyk and Schmid[29]. The FH-15 detector clearly outperforms its competi-tors.

Figure 21 shows the orthogonal projection of the Fast-Hessian (FH-15) features for the reconstructed angle. Inter-estingly, the theoretically better founded approaches likethe Harris- and Hessian-Laplace detectors perform worsethan the approximations (DoG and the SURF features).

5.2.2. N-view CaseThe SURF detection and description algorithms have

been integrated with the Epoch 3D Webservice of theVISICS research group at the K.U. Leuven 3 . This web-service allows users to upload sequences of still images

3 http://homes.esat.kuleuven.be/~visit3d/webservice/html

10

Page 11: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

Fig. 20. Input images for the quantitative detector evaluation. This

represents a good scene choice for the comparison of different types

of interest point detectors, as its components are simple geometricelements.

Fig. 21. Orthogonal projection of the reconstructed angle shown in

figure 20.

to a server. There, the calibration of the cameras anddense depth maps are computed automatically using theseimages only [40]. During the camera calibration stage,features need to be extracted and matched between theimages. The use of SURF features improved the results ofthis step for many uploaded image sets, especially when theimages were taken further apart. The previous procedureusing Harris corners and normalised cross correlation ofimage windows has problems matching such wide-baselineimages. Furthermore, the DoG detector combined withSIFT description failed on some image sequences, whereSURF succeeded to calibrate all the cameras accurately.

For the example in figure 22, the traditional approachmanaged to calibrate only 6 from a total of 13 cameras.Using SURF however, all 13 cameras could be calibrated.The vase is easily recognisable even in the sparse 3D model.

Figure 23 shows a typical wide-baseline problem: three

detector angle mean dist std-dev

FH-15: 88.5◦ 1.14 px 1.23 px

FH-9: 88.4◦ 1.64 px 1.78 px

DoG: 88.9◦ 1.95 px 2.14 px

Harris-Laplace: 88.3◦ 2.13 px 2.33 px

Hessian-Laplace: 91.1◦ 2.85 px 3.13 px

Table 2Comparison of different interest point detectors for the application ofcamera calibration and 3D reconstruction. The true angle is 88.6◦.

Fig. 22. 3D reconstruction with KU-Leuven’s 3D webservice. Left:One of the 13 input images for the camera calibration. Right: Position

of the reconstructed cameras and sparse 3D model of the vase.

images, taken from different, widely separated view points.It is a challenging example, as three images represent theabsolute minimum number of images needed for an accu-rate dense 3D reconstruction. The obtained 3D model canbe seen in figure 23 (bottom). In general, the quality of thecamera calibration can best be appreciated on the basis ofthe quality of such resulting dense models. These exper-iments confirm the use of the SURF detector/descriptorpair for applications in image registration, camera calibra-tion, and 3D reconstruction, where the accuracy of the cor-respondences is vital.

5.3. Application to Object Recognition

Bay et al. [3] already demonstrated the usefulness ofSURF in a simple object detection task. To further illus-trate the quality of the descriptor in such a scenario, wepresent some further experiments. Basis for this was a pub-licly available implementation of two bag-of-words classi-fiers [10]. 4 Given an image, the task is to identify whetheran object occurs in the image or not. For our comparison,we considered the naive Bayes classifier, which works di-rectly on the bag-of-words representation, as suggested byDance et al. [8]. This simple classifier was chosen as morecomplicated methods like pLSA might wash out the ac-tual effect of the descriptor. Similar to [10], we executedour tests on 400 images each from the Caltech backgroundand airplanes set 5 . 50% of the images are used for train-ing, the other 50% for testing. To minimise the influence ofthe partitioning, the same random permutation of trainingand test sets was chosen for all descriptors. While this is arather simple test set for object recognition in general, itdefinitely serves the purpose of comparing the performanceof the actual descriptors.

The framework already provides interest points, chosenrandomly along Canny edges to create a very dense sam-pling. These are then fed to the various descriptors. Addi-tionally, we also consider the use of SURF keypoints, gen-erated with a very low threshold, to ensure good coverage.

Figure 24 shows the obtained ROC curves for SURF-128,SIFT and GLOH. Note that for the calculation of SURF,the sign of the Laplacian was removed from the descriptor.For both types of interest points, SURF-128 outperforms

4 http://people.csail.mit.edu/fergus/iccv2005/bagwords.html5 http://www.vision.caltech.edu/html-files/archive.html

11

Page 12: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

Fig. 23. 3D reconstruction with KU-Leuven’s 3D webservice. Toprow: The 3 input images of a detail of the San Marco Cathedral in

Venice. Middle row: Samples of the textured dense reconstruction.

Bottom row: un-textured dense reconstruction. The quality of thedense 3D model directly reflects the quality of the camera calibration.

The images were taken by Maurizio Forte, CNR-ITABC, Rome).

its competitors on the majority of the curve significantly.Figure 25 investigates the effect of the index size and theextended descriptor of SURF. As can be seen, the uprightcounterparts for both SURF-128 and SURF-64 performbest. This makes sense, as basically all the images in thedatabase were taken in an upright position. The other al-ternatives perform only slightly worse, but are comparable.Even SURF-36 exhibits similar discriminative power, andprovides a speed-up for the various parts of the recognitionsystem due to its small descriptor size.

The same tests were also carried out for the Caltech mo-torcycles (side) and faces dataset, yielding similar results.

In conclusion, SURF proves to be work great for classi-fication tasks, performing better than the competition onthe test sets, while still being faster to compute. These pos-itive results indicate that SURF should be very well suitedfor tasks in object detection, object recognition or imageretrieval.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P(false alarm)

P(d

etec

tion)

ROC Curves − Test

SURF−128SIFTGLOH

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P(false alarm)

P(d

etec

tion)

ROC Curves − Test

SURF−128SIFTGLOH

Fig. 24. Comparison of different descriptor strategies for a naive

Bayes classifier working on a bag-of-words representation. Top: de-

scriptors evaluated on random edge pixels. Bottom: on SURF key-points.

6. Conclusion and Outlook

We presented a fast and performant scale and rotation-invariant interest point detector and descriptor. The impor-tant speed gain is due to the use of integral images, whichdrastically reduce the number of operations for simple boxconvolutions, independent of the chosen scale. The resultsshowed that the performance of our Hessian approximationis comparable and sometimes even better than the state-of-the-art interest point detectors. The high repeatabilityis advantageous for camera self-calibration, where an accu-rate interest point detection has a direct impact on the ac-curacy of the camera self-calibration and therefore on thequality of the resulting 3D model.

The most important improvement, however, is the speedof the detector. Even without any dedicated optimisations,an almost real-time computation without loss in perfor-mance is possible, which represents an important advan-tage for many on-line computer vision applications.

Our descriptor, based on sums of Haar wavelet compo-nents, outperforms the state-of-the-art methods. It seems

12

Page 13: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P(false alarm)

P(d

etec

tion)

ROC Curves − Test

U−SURF−64SURF−36SURF−64

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P(false alarm)

P(d

etec

tion)

ROC Curves − Test

U−SURF−128SURF−72SURF−128

Fig. 25. Comparison of different options for the SURF descriptor for

a naive Bayes classifier working on a bag-of-words representation.

The descriptor was evaluated on SURF keypoints. Top: standard,bottom: extended descriptor.

that the description of the nature of the underlying image-intensity pattern is more distinctive than histogram basedapproaches. The simplicity and again the use of integralimages make our descriptor competitive in terms of speed.Moreover, the Laplacian-based indexing strategy makes thematching step faster without any loss in terms of perfor-mance.

Experiments for camera calibration and object recogni-tion highlighted SURF’s potential in a wide range of com-puter vision applications. In the former, the accuracy ofthe interest points and the distinctiveness of the descriptorshowed to be major factors for obtaining a more accurate3D reconstruction, or even getting any 3D reconstructionat all in difficult cases. In the latter, the descriptor gener-alises well enough to outperform its competitors in a simpleobject recognition task as well.

The latest version of SURF is available for public down-load. 6

6 http://www.vision.ee.ethz.ch/~surf/

Acknowledgments

The authors gratefully acknowledge the support fromToyota Motor Europe and Toyota Motor Corporation, theSwiss SNF NCCR project IM2, and the Flemish Fund forScientific Research. Geert Willems for identifying a bug inthe interpolation used in the descriptor. Stefan Saur forporting SURF to Windows. Maarten Vergauwen for pro-viding us with the 3D reconstructions. Dr. Bastian Leibefor numerous inspiring discussions. The DoG detector waskindly provided by David Lowe.

References

[1] A. Baumberg. Reliable feature matching across widely separated

views. In CVPR, pages 774 – 781, 2000.[2] H. Bay. From Wide-baseline Point and Line Correspondences

to 3D. PhD thesis, ETH Zurich, 2006.[3] H. Bay, B. Fasel, and L. van Gool. Interactive museum guide:

Fast and robust recognition of museum objects. In Proceedingsof the first international workshop on mobile vision, May 2006.

[4] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded uprobust features. In ECCV, 2006.

[5] M. Brown and D. Lowe. Invariant features from interest pointgroups. In BMVC, 2002.

[6] G. Carneiro and A.D. Jepson. Multi-scale phase-based local

features. In CVPR (1), pages 736 – 743, 2003.[7] P. C. Cattin, H. Bay, L. Van Gool, and G. Szekely. Retina

mosaicing using local features. In Medical Image Computingand Computer-Assisted Intervention (MICCAI), October 2006.

in press.[8] C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka.

Visual categorization with bags of keypoints. In ECCV

International Workshop on Statistical Learning in ComputerVision, Prague, 2004.

[9] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In

Symposium on Computational Geometry, pages 253–262, 2004.[10] R. Fergus, P. Perona, and A. Zisserman. Object class recognition

by unsupervised scale-invariant learning. In CVPR (2), pages

264–271, 2003.[11] L. M. J. Florack, B. M. ter Haar Romeny, J. J. Koenderink,

and M. A. Viergever. General intensity transformations anddifferential invariants. JMIV, 4(2):171–187, 1994.

[12] W. T. Freeman and E. H. Adelson. The design and use ofsteerable filters. PAMI, 13(9):891 – 906, 1991.

[13] J. Goldstein, J. C. Platt, and C. J. C. Burges. Redundantbit vectors for quickly searching high-dimensional regions. InDeterministic and Statistical Methods in Machine Learning,

pages 137–158, 2004.[14] M. Grabner, H. Grabner, and H. Bischof. Fast approximated

sift. In ACCV (1), pages 918–927, 2006.[15] C. Harris and M. Stephens. A combined corner and edge

detector. In Proceedings of the Alvey Vision Conference, pages147 – 151, 1988.

[16] F. Jurie and C. Schmid. Scale-invariant shape features for

recognition of object categories. In CVPR, volume II, pages 90– 96, 2004.

[17] T. Kadir and M. Brady. Scale, saliency and image description.IJCV, 45(2):83 – 105, 2001.

[18] Y. Ke and R. Sukthankar. PCA-SIFT: A more distinctiverepresentation for local image descriptors. In CVPR (2), pages506 – 513, 2004.

[19] J.J. Koenderink. The structure of images. Biological

Cybernetics, 50:363 – 370, 1984.

13

Page 14: Speeded-UpRobustFeatures(SURF)misha/ReadingSeminar/Papers/Bay08.pdf · Speeded-UpRobustFeatures(SURF) Herbert Bay a, Andreas Ess , Tinne Tuytelaarsb, and Luc Van Goola;b aETH Zurich,

[20] T. Lindeberg. Scale-space for discrete signals. PAMI, 12(3):234–

254, 1990.

[21] T. Lindeberg. Feature detection with automatic scale selection.IJCV, 30(2):79 – 116, 1998.

[22] T. Lindeberg and L. Bretzner. Real-time scale selection in hybrid

multi-scale representations. In Scale-Space, pages 148–163, 2003.[23] D. Lowe. Object recognition from local scale-invariant features.

In ICCV, 1999.

[24] D. Lowe. Distinctive image features from scale-invariantkeypoints, cascade filtering approach. IJCV, 60(2):91 – 110,

January 2004.

[25] J. Matas, O. Chum, Urban M., and T. Pajdla. Robust widebaseline stereo from maximally stable extremal regions. In

BMVC, pages 384 – 393, 2002.[26] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant

interest points. In ICCV, volume 1, pages 525 – 531, 2001.

[27] K. Mikolajczyk and C. Schmid. An affine invariant interest pointdetector. In ECCV, pages 128 – 142, 2002.

[28] K. Mikolajczyk and C. Schmid. A performance evaluation of

local descriptors. In CVPR, volume 2, pages 257 – 263, June2003.

[29] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest

point detectors. IJCV, 60(1):63 – 86, 2004.[30] K. Mikolajczyk and C. Schmid. A performance evaluation of

local descriptors. PAMI, 27(10):1615–1630, 2005.

[31] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A

comparison of affine region detectors. IJCV, 65(1/2):43–72,2005.

[32] F. Mindru, T. Tuytelaars, L. Van Gool, and T. Moons.

Moment invariants for recognition under changing viewpoint andillumination. CVIU, 94(1-3):3–27, 2004.

[33] A. Neubeck and L. Van Gool. Efficient non-maximum

suppression. In ICPR, 2006.[34] D. Nister and H. Stewenius. Scalable recognition with a

vocabulary tree. In CVPR (2), pages 2161–2168, 2006.

[35] S.M. Omohundro. Five balltree construction algorithms.Technical report, ICSI Technical Report TR-89-063, 1989.

[36] F. Schaffalitzky and A. Zisserman. Multi-view matching for

unordered image sets, or “How do I organize my holiday snaps?”.In ECCV, volume 1, pages 414 – 431, 2002.

[37] S. Se, H.K. Ng, P. Jasiobedzki, and T.J. Moyung. Visionbased modeling and localization for planetary exploration rovers.

Proceedings of International Astronautical Congress, 2004.

[38] P. Simard, L. Bottou, P. Haffner, and Y. LeCun. Boxlets: A fastconvolution algorithm for signal processing and neural networks.

In NIPS, 1998.

[39] T. Tuytelaars and L. Van Gool. Wide baseline stereo based onlocal, affinely invariant regions. In BMVC, pages 412 – 422, 2000.

[40] M. Vergauwen and L. Van Gool. Web-based 3d reconstruction

service. Special Issue of Machine Vision and Applications, toappear.

[41] P.A. Viola and M.J. Jones. Rapid object detection using aboosted cascade of simple features. In CVPR (1), pages 511 –518, 2001.

14