distinctive image features from scale-invariant …ee225b/sp14/readings/...sistent scale for feature...

International Journal of Computer Vision 60(2), 91–110, 2004c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Distinctive Image Features from Scale-Invariant Keypoints

DAVID G. LOWEComputer Science Department, University of British Columbia, Vancouver, B.C., Canada

[email protected]

Received January 10, 2003; Revised January 7, 2004; Accepted January 22, 2004

Abstract. This paper presents a method for extracting distinctive invariant features from images that can be usedto perform reliable matching between different views of an object or scene. The features are invariant to image scaleand rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that asingle feature can be correctly matched with high probability against a large database of features from many images.This paper also describes an approach to using these features for object recognition. The recognition proceeds bymatching individual features to a database of features from known objects using a fast nearest-neighbor algorithm,followed by a Hough transform to identify clusters belonging to a single object, and finally performing verificationthrough least-squares solution for consistent pose parameters. This approach to recognition can robustly identifyobjects among clutter and occlusion while achieving near real-time performance.

Keywords: invariant features, object recognition, scale invariance, image matching

1. Introduction

Image matching is a fundamental aspect of many prob-lems in computer vision, including object or scenerecognition, solving for 3D structure from multiple im-ages, stereo correspondence, and motion tracking. Thispaper describes image features that have many prop-erties that make them suitable for matching differingimages of an object or scene. The features are invariantto image scaling and rotation, and partially invariant tochange in illumination and 3D camera viewpoint. Theyare well localized in both the spatial and frequency do-mains, reducing the probability of disruption by occlu-sion, clutter, or noise. Large numbers of features can beextracted from typical images with efficient algorithms.In addition, the features are highly distinctive, whichallows a single feature to be correctly matched withhigh probability against a large database of features,providing a basis for object and scene recognition.

The cost of extracting these features is minimized bytaking a cascade filtering approach, in which the more

expensive operations are applied only at locations thatpass an initial test. Following are the major stages ofcomputation used to generate the set of image features:

1. Scale-space extrema detection: The first stage ofcomputation searches over all scales and image lo-cations. It is implemented efficiently by using adifference-of-Gaussian function to identify poten-tial interest points that are invariant to scale andorientation.

2. Keypoint localization: At each candidate location, adetailed model is fit to determine location and scale.Keypoints are selected based on measures of theirstability.

3. Orientation assignment: One or more orientationsare assigned to each keypoint location based on localimage gradient directions. All future operations areperformed on image data that has been transformedrelative to the assigned orientation, scale, and loca-tion for each feature, thereby providing invarianceto these transformations.

92 Lowe

4. Keypoint descriptor: The local image gradients aremeasured at the selected scale in the region aroundeach keypoint. These are transformed into a repre-sentation that allows for significant levels of localshape distortion and change in illumination.

This approach has been named the Scale Invariant Fea-ture Transform (SIFT), as it transforms image data intoscale-invariant coordinates relative to local features.

An important aspect of this approach is that it gen-erates large numbers of features that densely cover theimage over the full range of scales and locations. A typ-ical image of size 500×500 pixels will give rise to about2000 stable features (although this number depends onboth image content and choices for various parame-ters). The quantity of features is particularly importantfor object recognition, where the ability to detect smallobjects in cluttered backgrounds requires that at least3 features be correctly matched from each object forreliable identification.

For image matching and recognition, SIFT featuresare first extracted from a set of reference images andstored in a database. A new image is matched by indi-vidually comparing each feature from the new image tothis previous database and finding candidate matchingfeatures based on Euclidean distance of their featurevectors. This paper will discuss fast nearest-neighboralgorithms that can perform this computation rapidlyagainst large databases.

The keypoint descriptors are highly distinctive,which allows a single feature to find its correct matchwith good probability in a large database of features.However, in a cluttered image, many features fromthe background will not have any correct match inthe database, giving rise to many false matches in ad-dition to the correct ones. The correct matches canbe filtered from the full set of matches by identify-ing subsets of keypoints that agree on the object andits location, scale, and orientation in the new image.The probability that several features will agree onthese parameters by chance is much lower than theprobability that any individual feature match will bein error. The determination of these consistent clus-ters can be performed rapidly by using an efficienthash table implementation of the generalized Houghtransform.

Each cluster of 3 or more features that agree on anobject and its pose is then subject to further detailedverification. First, a least-squared estimate is made foran affine approximation to the object pose. Any other

image features consistent with this pose are identified,and outliers are discarded. Finally, a detailed compu-tation is made of the probability that a particular set offeatures indicates the presence of an object, given theaccuracy of fit and number of probable false matches.Object matches that pass all these tests can be identifiedas correct with high confidence.

2. Related Research

The development of image matching by using a set oflocal interest points can be traced back to the work ofMoravec (1981) on stereo matching using a corner de-tector. The Moravec detector was improved by Harrisand Stephens (1988) to make it more repeatable un-der small image variations and near edges. Harris alsoshowed its value for efficient motion tracking and 3Dstructure from motion recovery (Harris, 1992), and theHarris corner detector has since been widely used formany other image matching tasks. While these featuredetectors are usually called corner detectors, they arenot selecting just corners, but rather any image locationthat has large gradients in all directions at a predeter-mined scale.

The initial applications were to stereo and short-range motion tracking, but the approach was later ex-tended to more difficult problems. Zhang et al. (1995)showed that it was possible to match Harris cornersover a large image range by using a correlation windowaround each corner to select likely matches. Outlierswere then removed by solving for a fundamental ma-trix describing the geometric constraints between thetwo views of rigid scene and removing matches that didnot agree with the majority solution. At the same time,a similar approach was developed by Torr (1995) forlong-range motion matching, in which geometric con-straints were used to remove outliers for rigid objectsmoving within an image.

The ground-breaking work of Schmid and Mohr(1997) showed that invariant local feature matchingcould be extended to general image recognition prob-lems in which a feature was matched against a largedatabase of images. They also used Harris corners toselect interest points, but rather than matching witha correlation window, they used a rotationally in-variant descriptor of the local image region. This al-lowed features to be matched under arbitrary orien-tation change between the two images. Furthermore,they demonstrated that multiple feature matches couldaccomplish general recognition under occlusion and

Distinctive Image Features from Scale-Invariant Keypoints 93

clutter by identifying consistent clusters of matchedfeatures.

The Harris corner detector is very sensitive tochanges in image scale, so it does not provide a good ba-sis for matching images of different sizes. Earlier workby the author (Lowe, 1999) extended the local featureapproach to achieve scale invariance. This work alsodescribed a new local descriptor that provided moredistinctive features while being less sensitive to localimage distortions such as 3D viewpoint change. Thiscurrent paper provides a more in-depth developmentand analysis of this earlier work, while also present-ing a number of improvements in stability and featureinvariance.

There is a considerable body of previous research onidentifying representations that are stable under scalechange. Some of the first work in this area was byCrowley and Parker (1984), who developed a repre-sentation that identified peaks and ridges in scale spaceand linked these into a tree structure. The tree structurecould then be matched between images with arbitraryscale change. More recent work on graph-based match-ing by Shokoufandeh et al. (1999) provides more dis-tinctive feature descriptors using wavelet coefficients.The problem of identifying an appropriate and con-sistent scale for feature detection has been studied indepth by Lindeberg (1993, 1994). He describes this asa problem of scale selection, and we make use of hisresults below.

Recently, there has been an impressive body ofwork on extending local features to be invariant tofull affine transformations (Baumberg, 2000; Tuyte-laars and Van Gool, 2000; Mikolajczyk and Schmid,2002; Schaffalitzky and Zisserman, 2002; Brown andLowe, 2002). This allows for invariant matching to fea-tures on a planar surface under changes in orthographic3D projection, in most cases by resampling the image ina local affine frame. However, none of these approachesare yet fully affine invariant, as they start with initialfeature scales and locations selected in a non-affine-invariant manner due to the prohibitive cost of explor-ing the full affine space. The affine frames are are alsomore sensitive to noise than those of the scale-invariantfeatures, so in practice the affine features have lowerrepeatability than the scale-invariant features unless theaffine distortion is greater than about a 40 degree tilt ofa planar surface (Mikolajczyk, 2002). Wider affine in-variance may not be important for many applications,as training views are best taken at least every 30 de-grees rotation in viewpoint (meaning that recognition

is within 15 degrees of the closest training view) in or-der to capture non-planar changes and occlusion effectsfor 3D objects.

While the method to be presented in this paper is notfully affine invariant, a different approach is used inwhich the local descriptor allows relative feature posi-tions to shift significantly with only small changes inthe descriptor. This approach not only allows the de-scriptors to be reliably matched across a considerablerange of affine distortion, but it also makes the featuresmore robust against changes in 3D viewpoint for non-planar surfaces. Other advantages include much moreefficient feature extraction and the ability to identifylarger numbers of features. On the other hand, affineinvariance is a valuable property for matching planarsurfaces under very large view changes, and furtherresearch should be performed on the best ways to com-bine this with non-planar 3D viewpoint invariance inan efficient and stable manner.

Many other feature types have been proposed for usein recognition, some of which could be used in additionto the features described in this paper to provide fur-ther matches under differing circumstances. One classof features are those that make use of image contoursor region boundaries, which should make them lesslikely to be disrupted by cluttered backgrounds nearobject boundaries. Matas et al. (2002) have shown thattheir maximally-stable extremal regions can producelarge numbers of matching features with good stabil-ity. Mikolajczyk et al. (2003) have developed a newdescriptor that uses local edges while ignoring unre-lated nearby edges, providing the ability to find stablefeatures even near the boundaries of narrow shapes su-perimposed on background clutter. Nelson and Selinger(1998) have shown good results with local featuresbased on groupings of image contours. Similarly, Popeand Lowe (2000) used features based on the hierarchi-cal grouping of image contours, which are particularlyuseful for objects lacking detailed texture.

The history of research on visual recognition con-tains work on a diverse set of other image propertiesthat can be used as feature measurements. Carneiro andJepson (2002) describe phase-based local features thatrepresent the phase rather than the magnitude of localspatial frequencies, which is likely to provide improvedinvariance to illumination. Schiele and Crowley (2000)have proposed the use of multidimensional histogramssummarizing the distribution of measurements withinimage regions. This type of feature may be particu-larly useful for recognition of textured objects with

94 Lowe

deformable shapes. Basri and Jacobs (1997) havedemonstrated the value of extracting local regionboundaries for recognition. Other useful properties toincorporate include color, motion, figure-ground dis-crimination, region shape descriptors, and stereo depthcues. The local feature approach can easily incorporatenovel feature types because extra features contributeto robustness when they provide correct matches, butotherwise do little harm other than their cost of compu-tation. Therefore, future systems are likely to combinemany feature types.

3. Detection of Scale-Space Extrema

As described in the introduction, we will detect key-points using a cascade filtering approach that uses effi-cient algorithms to identify candidate locations that arethen examined in further detail. The first stage of key-point detection is to identify locations and scales thatcan be repeatably assigned under differing views ofthe same object. Detecting locations that are invariantto scale change of the image can be accomplished bysearching for stable features across all possible scales,using a continuous function of scale known as scalespace (Witkin, 1983).

It has been shown by Koenderink (1984) andLindeberg (1994) that under a variety of reasonableassumptions the only possible scale-space kernel isthe Gaussian function. Therefore, the scale space ofan image is defined as a function, L(x, y, σ ), thatis produced from the convolution of a variable-scaleGaussian, G(x, y, σ ), with an input image, I (x, y):

L(x, y, σ ) = G(x, y, σ ) ∗ I (x, y),

where ∗ is the convolution operation in x and y,and

G(x, y, σ ) = 1

2πσ 2e−(x2+y2)/2σ 2

.

To efficiently detect stable keypoint locations in scalespace, we have proposed (Lowe, 1999) using scale-space extrema in the difference-of-Gaussian functionconvolved with the image, D(x, y, σ ), which can becomputed from the difference of two nearby scales sep-arated by a constant multiplicative factor k:

D(x, y, σ ) = (G(x, y, kσ ) − G(x, y, σ )) ∗ I (x, y)

= L(x, y, kσ ) − L(x, y, σ ). (1)

There are a number of reasons for choosing thisfunction. First, it is a particularly efficient function tocompute, as the smoothed images, L , need to be com-puted in any case for scale space feature description,and D can therefore be computed by simple imagesubtraction.

In addition, the difference-of-Gaussian function pro-vides a close approximation to the scale-normalizedLaplacian of Gaussian, σ 2∇2G, as studied byLindeberg (1994). Lindeberg showed that the normal-ization of the Laplacian with the factor σ 2 is requiredfor true scale invariance. In detailed experimental com-parisons, Mikolajczyk (2002) found that the maximaand minima of σ 2∇2G produce the most stable imagefeatures compared to a range of other possible imagefunctions, such as the gradient, Hessian, or Harris cor-ner function.

The relationship between D and σ 2∇2G can be un-derstood from the heat diffusion equation (parameter-ized in terms of σ rather than the more usual t = σ 2):

∂G

∂σ= σ∇2G.

From this, we see that ∇2G can be computed from thefinite difference approximation to ∂G/∂σ , using thedifference of nearby scales at kσ and σ :

σ∇2G = ∂G

∂σ≈ G(x, y, kσ ) − G(x, y, σ )

kσ − σ

and therefore,

G(x, y, kσ ) − G(x, y, σ ) ≈ (k − 1)σ 2∇2G.

This shows that when the difference-of-Gaussian func-tion has scales differing by a constant factor it alreadyincorporates the σ 2 scale normalization required forthe scale-invariant Laplacian. The factor (k − 1) in theequation is a constant over all scales and therefore doesnot influence extrema location. The approximation er-ror will go to zero as k goes to 1, but in practice we havefound that the approximation has almost no impact onthe stability of extrema detection or localization foreven significant differences in scale, such as k = √

2.An efficient approach to construction of D(x, y, σ )

is shown in Fig. 1. The initial image is incrementallyconvolved with Gaussians to produce images separatedby a constant factor k in scale space, shown stacked inthe left column. We choose to divide each octave ofscale space (i.e., doubling of σ ) into an integer num-ber, s, of intervals, so k = 21/s . We must produce s +3


Figure 1. For each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space imagesshown on the left. Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images on the right. After each octave, theGaussian image is down-sampled by a factor of 2, and the process repeated.

images in the stack of blurred images for each octave,so that final extrema detection covers a complete oc-tave. Adjacent image scales are subtracted to producethe difference-of-Gaussian images shown on the right.Once a complete octave has been processed, we resam-ple the Gaussian image that has twice the initial valueof σ (it will be 2 images from the top of the stack) bytaking every second pixel in each row and column. Theaccuracy of sampling relative to σ is no different thanfor the start of the previous octave, while computationis greatly reduced.

3.1. Local Extrema Detection

In order to detect the local maxima and minima ofD(x, y, σ ), each sample point is compared to its eightneighbors in the current image and nine neighbors inthe scale above and below (see Fig. 2). It is selectedonly if it is larger than all of these neighbors or smallerthan all of them. The cost of this check is reasonablylow due to the fact that most sample points will beeliminated following the first few checks.

An important issue is to determine the frequencyof sampling in the image and scale domains that isneeded to reliably detect the extrema. Unfortunately,it turns out that there is no minimum spacing of sam-

Figure 2. Maxima and minima of the difference-of-Gaussian im-ages are detected by comparing a pixel (marked with X) to its 26neighbors in 3 × 3 regions at the current and adjacent scales (markedwith circles).

ples that will detect all extrema, as the extrema can bearbitrarily close together. This can be seen by consid-ering a white circle on a black background, which willhave a single scale space maximum where the circularpositive central region of the difference-of-Gaussianfunction matches the size and location of the circle.For a very elongated ellipse, there will be two max-ima near each end of the ellipse. As the locations ofmaxima are a continuous function of the image, forsome ellipse with intermediate elongation there willbe a transition from a single maximum to two, withthe maxima arbitrarily close to each other near thetransition.

96 Lowe

Figure 3. The top line of the first graph shows the percent of keypoints that are repeatably detected at the same location and scale in atransformed image as a function of the number of scales sampled per octave. The lower line shows the percent of keypoints that have theirdescriptors correctly matched to a large database. The second graph shows the total number of keypoints detected in a typical image as a functionof the number of scale samples.

Therefore, we must settle for a solution that tradesoff efficiency with completeness. In fact, as might beexpected and is confirmed by our experiments, extremathat are close together are quite unstable to small per-turbations of the image. We can determine the bestchoices experimentally by studying a range of samplingfrequencies and using those that provide the most reli-able results under a realistic simulation of the matchingtask.

3.2. Frequency of Sampling in Scale

The experimental determination of sampling frequencythat maximizes extrema stability is shown in Figs. 3 and4. These figures (and most other simulations in this pa-per) are based on a matching task using a collection of32 real images drawn from a diverse range, includingoutdoor scenes, human faces, aerial photographs, andindustrial images (the image domain was found to havealmost no influence on any of the results). Each imagewas then subject to a range of transformations, includ-ing rotation, scaling, affine stretch, change in brightnessand contrast, and addition of image noise. Because thechanges were synthetic, it was possible to preciselypredict where each feature in an original image shouldappear in the transformed image, allowing for measure-ment of correct repeatability and positional accuracyfor each feature.

Figure 3 shows these simulation results used to ex-amine the effect of varying the number of scales peroctave at which the image function is sampled prior to

Figure 4. The top line in the graph shows the percent of keypointlocations that are repeatably detected in a transformed image as afunction of the prior image smoothing for the first level of eachoctave. The lower line shows the percent of descriptors correctlymatched against a large database.

extrema detection. In this case, each image was resam-pled following rotation by a random angle and scalingby a random amount between 0.2 of 0.9 times the orig-inal size. Keypoints from the reduced resolution imagewere matched against those from the original image sothat the scales for all keypoints would be be present inthe matched image. In addition, 1% image noise wasadded, meaning that each pixel had a random numberadded from the uniform interval [−0.01, 0.01] wherepixel values are in the range [0, 1] (equivalent to pro-viding slightly less than 6 bits of accuracy for imagepixels).


The top line in the first graph of Fig. 3 shows thepercent of keypoints that are detected at a matchinglocation and scale in the transformed image. For allexamples in this paper, we define a matching scale asbeing within a factor of

√2 of the correct scale, and

a matching location as being within σ pixels, whereσ is the scale of the keypoint (defined from Eq. (1) asthe standard deviation of the smallest Gaussian usedin the difference-of-Gaussian function). The lower lineon this graph shows the number of keypoints that arecorrectly matched to a database of 40,000 keypointsusing the nearest-neighbor matching procedure to bedescribed in Section 6 (this shows that once the key-point is repeatably located, it is likely to be useful forrecognition and matching tasks). As this graph shows,the highest repeatability is obtained when sampling3 scales per octave, and this is the number of scalesamples used for all other experiments throughout thispaper.

It might seem surprising that the repeatability doesnot continue to improve as more scales are sampled.The reason is that this results in many more local ex-trema being detected, but these extrema are on averageless stable and therefore are less likely to be detectedin the transformed image. This is shown by the secondgraph in Fig. 3, which shows the average number ofkeypoints detected and correctly matched in each im-age. The number of keypoints rises with increased sam-pling of scales and the total number of correct matchesalso rises. Since the success of object recognition oftendepends more on the quantity of correctly matched key-points, as opposed to their percentage correct match-ing, for many applications it will be optimal to use alarger number of scale samples. However, the cost ofcomputation also rises with this number, so for the ex-periments in this paper we have chosen to use just 3scale samples per octave.

To summarize, these experiments show that thescale-space difference-of-Gaussian function has a largenumber of extrema and that it would be very expensiveto detect them all. Fortunately, we can detect the moststable and useful subset even with a coarse samplingof scales.

3.3. Frequency of Sampling in the Spatial Domain

Just as we determined the frequency of sampling per oc-tave of scale space, so we must determine the frequencyof sampling in the image domain relative to the scale ofsmoothing. Given that extrema can be arbitrarily close

together, there will be a similar trade-off between sam-pling frequency and rate of detection. Figure 4 showsan experimental determination of the amount of priorsmoothing, σ , that is applied to each image level beforebuilding the scale space representation for an octave.Again, the top line is the repeatability of keypoint de-tection, and the results show that the repeatability con-tinues to increase with σ . However, there is a cost tousing a large σ in terms of efficiency, so we have cho-sen to use σ = 1.6, which provides close to optimalrepeatability. This value is used throughout this paperand was used for the results in Fig. 3.

Of course, if we pre-smooth the image before ex-trema detection, we are effectively discarding the high-est spatial frequencies. Therefore, to make full use ofthe input, the image can be expanded to create moresample points than were present in the original. Wedouble the size of the input image using linear inter-polation prior to building the first level of the pyra-mid. While the equivalent operation could effectivelyhave been performed by using sets of subpixel-offsetfilters on the original image, the image doubling leadsto a more efficient implementation. We assume that theoriginal image has a blur of at least σ = 0.5 (the min-imum needed to prevent significant aliasing), and thattherefore the doubled image has σ = 1.0 relative toits new pixel spacing. This means that little additionalsmoothing is needed prior to creation of the first oc-tave of scale space. The image doubling increases thenumber of stable keypoints by almost a factor of 4, butno significant further improvements were found with alarger expansion factor.

4. Accurate Keypoint Localization

Once a keypoint candidate has been found by com-paring a pixel to its neighbors, the next step is to per-form a detailed fit to the nearby data for location, scale,and ratio of principal curvatures. This information al-lows points to be rejected that have low contrast (andare therefore sensitive to noise) or are poorly localizedalong an edge.

The initial implementation of this approach (Lowe,1999) simply located keypoints at the location and scaleof the central sample point. However, recently Brownhas developed a method (Brown and Lowe, 2002) forfitting a 3D quadratic function to the local sample pointsto determine the interpolated location of the maximum,and his experiments showed that this provides a sub-stantial improvement to matching and stability. His

98 Lowe

approach uses the Taylor expansion (up to the quadraticterms) of the scale-space function, D(x, y, σ ), shiftedso that the origin is at the sample point:

D(x) = D + ∂ D

∂x

T

x + 1

2xT ∂2 D

∂x2x (2)

where D and its derivatives are evaluated at the samplepoint and x = (x, y, σ )T is the offset from this point.The location of the extremum, x̂, is determined by tak-ing the derivative of this function with respect to x andsetting it to zero, giving

x̂ = −∂2 D

∂x2

−1∂ D

∂x. (3)

As suggested by Brown, the Hessian and derivativeof D are approximated by using differences of neigh-boring sample points. The resulting 3×3 linear systemcan be solved with minimal cost. If the offset x̂ is largerthan 0.5 in any dimension, then it means that the ex-tremum lies closer to a different sample point. In this

Figure 5. This figure shows the stages of keypoint selection. (a) The 233 × 189 pixel original image. (b) The initial 832 keypoints locationsat maxima and minima of the difference-of-Gaussian function. Keypoints are displayed as vectors indicating scale, orientation, and location.(c) After applying a threshold on minimum contrast, 729 keypoints remain. (d) The final 536 keypoints that remain following an additionalthreshold on ratio of principal curvatures.

case, the sample point is changed and the interpolationperformed instead about that point. The final offset x̂is added to the location of its sample point to get theinterpolated estimate for the location of the extremum.

The function value at the extremum, D(x̂), is usefulfor rejecting unstable extrema with low contrast. Thiscan be obtained by substituting Eqs (3) into (2), giving

D(x̂) = D + 1

2

∂ D

∂x

T

x̂.

For the experiments in this paper, all extrema witha value of D(x̂) less than 0.03 were discarded (asbefore, we assume image pixel values in the range[0, 1]).

Figure 5 shows the effects of keypoint selection ona natural image. In order to avoid too much clutter, alow-resolution 233 by 189 pixel image is used and key-points are shown as vectors giving the location, scale,and orientation of each keypoint (orientation assign-ment is described below). Figure 5(a) shows the origi-nal image, which is shown at reduced contrast behind


the subsequent figures. Figure 5(b) shows the 832keypoints at all detected maxima and minima of thedifference-of-Gaussian function, while (c) shows the729 keypoints that remain following removal of thosewith a value of D(x̂) less than 0.03. Part (d) will beexplained in the following section.

4.1. Eliminating Edge Responses

For stability, it is not sufficient to reject keypoints withlow contrast. The difference-of-Gaussian function willhave a strong response along edges, even if the loca-tion along the edge is poorly determined and thereforeunstable to small amounts of noise.

A poorly defined peak in the difference-of-Gaussianfunction will have a large principal curvature acrossthe edge but a small one in the perpendicular direction.The principal curvatures can be computed from a 2×2Hessian matrix, H, computed at the location and scaleof the keypoint:

H =[

Dxx Dxy

Dxy Dyy

](4)

The derivatives are estimated by taking differences ofneighboring sample points.

The eigenvalues of H are proportional to the princi-pal curvatures of D. Borrowing from the approach usedby Harris and Stephens (1988), we can avoid explicitlycomputing the eigenvalues, as we are only concernedwith their ratio. Let α be the eigenvalue with the largestmagnitude and β be the smaller one. Then, we can com-pute the sum of the eigenvalues from the trace of H andtheir product from the determinant:

Tr(H) = Dxx + Dyy = α + β,

Det(H) = Dxx Dyy − (Dxy)2 = αβ.

In the unlikely event that the determinant is negative,the curvatures have different signs so the point is dis-carded as not being an extremum. Let r be the ratio be-tween the largest magnitude eigenvalue and the smaller

m(x, y) =√

(L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2

θ (x, y) = tan−1((L(x, y + 1) − L(x, y − 1))/(L(x + 1, y) − L(x − 1, y)))

one, so that α = rβ. Then,

Tr(H)2

Det(H)= (α + β)2

αβ= (rβ + β)2

rβ2= (r + 1)2

r,

which depends only on the ratio of the eigenval-ues rather than their individual values. The quantity(r + 1)2/r is at a minimum when the two eigenvaluesare equal and it increases with r . Therefore, to checkthat the ratio of principal curvatures is below somethreshold, r , we only need to check

Tr(H)2

Det(H)<

(r + 1)2

r.

This is very efficient to compute, with less than 20floating point operations required to test each keypoint.The experiments in this paper use a value of r = 10,which eliminates keypoints that have a ratio betweenthe principal curvatures greater than 10. The transi-tion from Fig. 5(c) to (d) shows the effects of thisoperation.

5. Orientation Assignment

By assigning a consistent orientation to each keypointbased on local image properties, the keypoint descrip-tor can be represented relative to this orientation andtherefore achieve invariance to image rotation. This ap-proach contrasts with the orientation invariant descrip-tors of Schmid and Mohr (1997), in which each imageproperty is based on a rotationally invariant measure.The disadvantage of that approach is that it limits thedescriptors that can be used and discards image infor-mation by not requiring all measures to be based on aconsistent rotation.

Following experimentation with a number of ap-proaches to assigning a local orientation, the follow-ing approach was found to give the most stable results.The scale of the keypoint is used to select the Gaussiansmoothed image, L , with the closest scale, so that allcomputations are performed in a scale-invariant man-ner. For each image sample, L(x, y), at this scale, thegradient magnitude, m(x, y), and orientation, θ (x, y),is precomputed using pixel differences:

100 Lowe

An orientation histogram is formed from the gradientorientations of sample points within a region around thekeypoint. The orientation histogram has 36 bins cov-ering the 360 degree range of orientations. Each sam-ple added to the histogram is weighted by its gradientmagnitude and by a Gaussian-weighted circular win-dow with a σ that is 1.5 times that of the scale of thekeypoint.

Peaks in the orientation histogram correspond todominant directions of local gradients. The highestpeak in the histogram is detected, and then any other lo-cal peak that is within 80% of the highest peak is used toalso create a keypoint with that orientation. Therefore,for locations with multiple peaks of similar magnitude,there will be multiple keypoints created at the same lo-cation and scale but different orientations. Only about15% of points are assigned multiple orientations, butthese contribute significantly to the stability of match-ing. Finally, a parabola is fit to the 3 histogram valuesclosest to each peak to interpolate the peak position forbetter accuracy.

Figure 6 shows the experimental stability of loca-tion, scale, and orientation assignment under differ-ing amounts of image noise. As before the images arerotated and scaled by random amounts. The top lineshows the stability of keypoint location and scale as-signment. The second line shows the stability of match-ing when the orientation assignment is also required tobe within 15 degrees. As shown by the gap betweenthe top two lines, the orientation assignment remainsaccurate 95% of the time even after addition of ±10%

Figure 6. The top line in the graph shows the percent of keypointlocations and scales that are repeatably detected as a function of pixelnoise. The second line shows the repeatability after also requiringagreement in orientation. The bottom line shows the final percent ofdescriptors correctly matched to a large database.

pixel noise (equivalent to a camera providing less than3 bits of precision). The measured variance of orienta-tion for the correct matches is about 2.5 degrees, risingto 3.9 degrees for 10% noise. The bottom line in Fig. 6shows the final accuracy of correctly matching a key-point descriptor to a database of 40,000 keypoints (tobe discussed below). As this graph shows, the SIFT fea-tures are resistant to even large amounts of pixel noise,and the major cause of error is the initial location andscale detection.

6. The Local Image Descriptor

The previous operations have assigned an image lo-cation, scale, and orientation to each keypoint. Theseparameters impose a repeatable local 2D coordinatesystem in which to describe the local image region,and therefore provide invariance to these parameters.The next step is to compute a descriptor for the localimage region that is highly distinctive yet is as invariantas possible to remaining variations, such as change inillumination or 3D viewpoint.

One obvious approach would be to sample the localimage intensities around the keypoint at the appropriatescale, and to match these using a normalized correla-tion measure. However, simple correlation of imagepatches is highly sensitive to changes that cause mis-registration of samples, such as affine or 3D viewpointchange or non-rigid deformations. A better approachhas been demonstrated by Edelman et al. (1997). Theirproposed representation was based upon a model ofbiological vision, in particular of complex neurons inprimary visual cortex. These complex neurons respondto a gradient at a particular orientation and spatial fre-quency, but the location of the gradient on the retinais allowed to shift over a small receptive field ratherthan being precisely localized. Edelman et al. hypoth-esized that the function of these complex neurons wasto allow for matching and recognition of 3D objectsfrom a range of viewpoints. They have performed de-tailed experiments using 3D computer models of objectand animal shapes which show that matching gradi-ents while allowing for shifts in their position resultsin much better classification under 3D rotation. For ex-ample, recognition accuracy for 3D objects rotated indepth by 20 degrees increased from 35% for correlationof gradients to 94% using the complex cell model. Ourimplementation described below was inspired by thisidea, but allows for positional shift using a differentcomputational mechanism.


Figure 7. A keypoint descriptor is created by first computing the gradient magnitude and orientation at each image sample point in a regionaround the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samplesare then accumulated into orientation histograms summarizing the contents over 4x4 subregions, as shown on the right, with the length of eacharrow corresponding to the sum of the gradient magnitudes near that direction within the region. This figure shows a 2 × 2 descriptor arraycomputed from an 8 × 8 set of samples, whereas the experiments in this paper use 4 × 4 descriptors computed from a 16 × 16 sample array.

6.1. Descriptor Representation

Figure 7 illustrates the computation of the keypoint de-scriptor. First the image gradient magnitudes and ori-entations are sampled around the keypoint location,using the scale of the keypoint to select the level ofGaussian blur for the image. In order to achieve ori-entation invariance, the coordinates of the descriptorand the gradient orientations are rotated relative tothe keypoint orientation. For efficiency, the gradientsare precomputed for all levels of the pyramid as de-scribed in Section 5. These are illustrated with smallarrows at each sample location on the left side ofFig. 7.

A Gaussian weighting function with σ equal to onehalf the width of the descriptor window is used to as-sign a weight to the magnitude of each sample point.This is illustrated with a circular window on the leftside of Fig. 7, although, of course, the weight fallsoff smoothly. The purpose of this Gaussian window isto avoid sudden changes in the descriptor with smallchanges in the position of the window, and to give lessemphasis to gradients that are far from the center of thedescriptor, as these are most affected by misregistrationerrors.

The keypoint descriptor is shown on the right sideof Fig. 7. It allows for significant shift in gradient po-sitions by creating orientation histograms over 4 × 4sample regions. The figure shows eight directions foreach orientation histogram, with the length of each ar-row corresponding to the magnitude of that histogramentry. A gradient sample on the left can shift up to 4sample positions while still contributing to the same

histogram on the right, thereby achieving the objectiveof allowing for larger local positional shifts.

It is important to avoid all boundary affects in whichthe descriptor abruptly changes as a sample shiftssmoothly from being within one histogram to anotheror from one orientation to another. Therefore, trilin-ear interpolation is used to distribute the value of eachgradient sample into adjacent histogram bins. In otherwords, each entry into a bin is multiplied by a weight of1−d for each dimension, where d is the distance of thesample from the central value of the bin as measuredin units of the histogram bin spacing.

The descriptor is formed from a vector containingthe values of all the orientation histogram entries, cor-responding to the lengths of the arrows on the right sideof Fig. 7. The figure shows a 2 × 2 array of orienta-tion histograms, whereas our experiments below showthat the best results are achieved with a 4 × 4 array ofhistograms with 8 orientation bins in each. Therefore,the experiments in this paper use a 4 × 4 × 8 = 128element feature vector for each keypoint.

Finally, the feature vector is modified to reduce theeffects of illumination change. First, the vector is nor-malized to unit length. A change in image contrast inwhich each pixel value is multiplied by a constant willmultiply gradients by the same constant, so this contrastchange will be canceled by vector normalization. Abrightness change in which a constant is added to eachimage pixel will not affect the gradient values, as theyare computed from pixel differences. Therefore, the de-scriptor is invariant to affine changes in illumination.However, non-linear illumination changes can also oc-cur due to camera saturation or due to illumination

102 Lowe

changes that affect 3D surfaces with differing orienta-tions by different amounts. These effects can cause alarge change in relative magnitudes for some gradients,but are less likely to affect the gradient orientations.Therefore, we reduce the influence of large gradientmagnitudes by thresholding the values in the unit fea-ture vector to each be no larger than 0.2, and then renor-malizing to unit length. This means that matching themagnitudes for large gradients is no longer as impor-tant, and that the distribution of orientations has greateremphasis. The value of 0.2 was determined experimen-tally using images containing differing illuminationsfor the same 3D objects.

6.2. Descriptor Testing

There are two parameters that can be used to vary thecomplexity of the descriptor: the number of orienta-tions, r , in the histograms, and the width, n, of then × n array of orientation histograms. The size of theresulting descriptor vector is rn2. As the complexity ofthe descriptor grows, it will be able to discriminate bet-ter in a large database, but it will also be more sensitiveto shape distortions and occlusion.

Figure 8 shows experimental results in which thenumber of orientations and size of the descriptor werevaried. The graph was generated for a viewpoint trans-formation in which a planar surface is tilted by 50 de-grees away from the viewer and 4% image noise isadded. This is near the limits of reliable matching, as

Figure 8. This graph shows the percent of keypoints giving thecorrect match to a database of 40,000 keypoints as a function of widthof the n×n keypoint descriptor and the number of orientations in eachhistogram. The graph is computed for images with affine viewpointchange of 50 degrees and addition of 4% noise.

it is in these more difficult cases that descriptor perfor-mance is most important. The results show the percentof keypoints that find a correct match to the single clos-est neighbor among a database of 40,000 keypoints.The graph shows that a single orientation histogram(n = 1) is very poor at discriminating, but the resultscontinue to improve up to a 4 × 4 array of histogramswith 8 orientations. After that, adding more orienta-tions or a larger descriptor can actually hurt matchingby making the descriptor more sensitive to distortion.These results were broadly similar for other degreesof viewpoint change and noise, although in some sim-pler cases discrimination continued to improve (fromalready high levels) with 5 × 5 and higher descriptorsizes. Throughout this paper we use a 4 × 4 descriptorwith 8 orientations, resulting in feature vectors with 128dimensions. While the dimensionality of the descriptormay seem high, we have found that it consistently per-forms better than lower-dimensional descriptors on arange of matching tasks and that the computational costof matching remains low when using the approximatenearest-neighbor methods described below.

6.3. Sensitivity to Affine Change

The sensitivity of the descriptor to affine change is ex-amined in Fig. 9. The graph shows the reliability ofkeypoint location and scale selection, orientation as-signment, and nearest-neighbor matching to a databaseas a function of rotation in depth of a plane away from

Figure 9. This graph shows the stability of detection for keypointlocation, orientation, and final matching to a database as a function ofaffine distortion. The degree of affine distortion is expressed in termsof the equivalent viewpoint rotation in depth for a planar surface.


a viewer. It can be seen that each stage of computationhas reduced repeatability with increasing affine distor-tion, but that the final matching accuracy remains above50% out to a 50 degree change in viewpoint.

To achieve reliable matching over a wider viewpointangle, one of the affine-invariant detectors could beused to select and resample image regions, as discussedin Section 2. As mentioned there, none of these ap-proaches is truly affine-invariant, as they all start frominitial feature locations determined in a non-affine-invariant manner. In what appears to be the most affine-invariant method, Mikolajczyk (2002) has proposedand run detailed experiments with the Harris-affine de-tector. He found that its keypoint repeatability is belowthat given here out to about a 50 degree viewpoint an-gle, but that it then retains close to 40% repeatabilityout to an angle of 70 degrees, which provides betterperformance for extreme affine changes. The disadvan-tages are a much higher computational cost, a reduc-tion in the number of keypoints, and poorer stability forsmall affine changes due to errors in assigning a consis-tent affine frame under noise. In practice, the allowablerange of rotation for 3D objects is considerably lessthan for planar surfaces, so affine invariance is usuallynot the limiting factor in the ability to match acrossviewpoint change. If a wide range of affine invarianceis desired, such as for a surface that is known to be pla-nar, then a simple solution is to adopt the approach ofPritchard and Heidrich (2003) in which additional SIFTfeatures are generated from 4 affine-transformed ver-sions of the training image corresponding to 60 degreeviewpoint changes. This allows for the use of standardSIFT features with no additional cost when process-ing the image to be recognized, but results in an in-crease in the size of the feature database by a factorof 3.

6.4. Matching to Large Databases

An important remaining issue for measuring the dis-tinctiveness of features is how the reliability of match-ing varies as a function of the number of features inthe database being matched. Most of the examples inthis paper are generated using a database of 32 im-ages with about 40,000 keypoints. Figure 10 showshow the matching reliability varies as a function ofdatabase size. This figure was generated using a largerdatabase of 112 images, with a viewpoint depth rota-tion of 30 degrees and 2% image noise in addition tothe usual random image rotation and scale change.

Figure 10. The dashed line shows the percent of keypoints cor-rectly matched to a database as a function of database size (usinga logarithmic scale). The solid line shows the percent of keypointsassigned the correct location, scale, and orientation. Images had ran-dom scale and rotation changes, an affine transform of 30 degrees,and image noise of 2% added prior to matching.

The dashed line shows the portion of image featuresfor which the nearest neighbor in the database was thecorrect match, as a function of database size shownon a logarithmic scale. The leftmost point is matchingagainst features from only a single image while therightmost point is selecting matches from a databaseof all features from the 112 images. It can be seen thatmatching reliability does decrease as a function of thenumber of distractors, yet all indications are that manycorrect matches will continue to be found out to verylarge database sizes.

The solid line is the percentage of keypoints thatwere identified at the correct matching location andorientation in the transformed image, so it is only thesepoints that have any chance of having matching de-scriptors in the database. The reason this line is flatis that the test was run over the full database for eachvalue, while only varying the portion of the databaseused for distractors. It is of interest that the gap be-tween the two lines is small, indicating that matchingfailures are due more to issues with initial feature lo-calization and orientation assignment than to problemswith feature distinctiveness, even out to large databasesizes.

7. Application to Object Recognition

The major topic of this paper is the derivation of dis-tinctive invariant keypoints, as described above. To

104 Lowe

demonstrate their application, we will now give a briefdescription of their use for object recognition in thepresence of clutter and occlusion. More details on ap-plications of these features to recognition are availablein other papers (Lowe, 1999, 2001; Se et al., 2002).

Object recognition is performed by first matchingeach keypoint independently to the database of key-points extracted from training images. Many of theseinitial matches will be incorrect due to ambiguous fea-tures or features that arise from background clutter.Therefore, clusters of at least 3 features are first identi-fied that agree on an object and its pose, as these clustershave a much higher probability of being correct than in-dividual feature matches. Then, each cluster is checkedby performing a detailed geometric fit to the model, andthe result is used to accept or reject the interpretation.

7.1. Keypoint Matching

The best candidate match for each keypoint is foundby identifying its nearest neighbor in the database ofkeypoints from training images. The nearest neighboris defined as the keypoint with minimum Euclideandistance for the invariant descriptor vector as was de-scribed in Section 6.

However, many features from an image will not haveany correct match in the training database because theyarise from background clutter or were not detected inthe training images. Therefore, it would be useful tohave a way to discard features that do not have anygood match to the database. A global threshold on dis-tance to the closest feature does not perform well, assome descriptors are much more discriminative thanothers. A more effective measure is obtained by com-paring the distance of the closest neighbor to that of thesecond-closest neighbor. If there are multiple trainingimages of the same object, then we define the second-closest neighbor as being the closest neighbor that isknown to come from a different object than the first,such as by only using images known to contain differentobjects. This measure performs well because correctmatches need to have the closest neighbor significantlycloser than the closest incorrect match to achieve reli-able matching. For false matches, there will likely be anumber of other false matches within similar distancesdue to the high dimensionality of the feature space.We can think of the second-closest match as provid-ing an estimate of the density of false matches withinthis portion of the feature space and at the same timeidentifying specific instances of feature ambiguity.

Figure 11. The probability that a match is correct can be deter-mined by taking the ratio of distance from the closest neighbor to thedistance of the second closest. Using a database of 40,000 keypoints,the solid line shows the PDF of this ratio for correct matches, whilethe dotted line is for matches that were incorrect.

Figure 11 shows the value of this measure for realimage data. The probability density functions for cor-rect and incorrect matches are shown in terms of theratio of closest to second-closest neighbors of each key-point. Matches for which the nearest neighbor was acorrect match have a PDF that is centered at a muchlower ratio than that for incorrect matches. For our ob-ject recognition implementation, we reject all matchesin which the distance ratio is greater than 0.8, whicheliminates 90% of the false matches while discardingless than 5% of the correct matches. This figure wasgenerated by matching images following random scaleand orientation change, a depth rotation of 30 degrees,and addition of 2% image noise, against a database of40,000 keypoints.

7.2. Efficient Nearest Neighbor Indexing

No algorithms are known that can identify the exactnearest neighbors of points in high dimensional spacesthat are any more efficient than exhaustive search.Our keypoint descriptor has a 128-dimensional featurevector, and the best algorithms, such as the k-d tree(Friedman et al., 1977) provide no speedup over ex-haustive search for more than about 10 dimensionalspaces. Therefore, we have used an approximate algo-rithm, called the Best-Bin-First (BBF) algorithm (Beisand Lowe, 1997). This is approximate in the sense thatit returns the closest neighbor with high probability.


The BBF algorithm uses a modified search orderingfor the k-d tree algorithm so that bins in feature spaceare searched in the order of their closest distance fromthe query location. This priority search order was firstexamined by Arya and Mount (1993), and they pro-vide further study of its computational properties inArya et al. (1998). This search order requires the useof a heap-based priority queue for efficient determina-tion of the search order. An approximate answer canbe returned with low cost by cutting off further searchafter a specific number of the nearest bins have beenexplored. In our implementation, we cut off search af-ter checking the first 200 nearest-neighbor candidates.For a database of 100,000 keypoints, this provides aspeedup over exact nearest neighbor search by about2 orders of magnitude yet results in less than a 5%loss in the number of correct matches. One reason theBBF algorithm works particularly well for this prob-lem is that we only consider matches in which the near-est neighbor is less than 0.8 times the distance to thesecond-nearest neighbor (as described in the previoussection), and therefore there is no need to exactly solvethe most difficult cases in which many neighbors are atvery similar distances.

7.3. Clustering with the Hough Transform

To maximize the performance of object recognition forsmall or highly occluded objects, we wish to iden-tify objects with the fewest possible number of fea-ture matches. We have found that reliable recognitionis possible with as few as 3 features. A typical im-age contains 2,000 or more features which may comefrom many different objects as well as background clut-ter. While the distance ratio test described in Section7.1 will allow us to discard many of the false matchesarising from background clutter, this does not removematches from other valid objects, and we often stillneed to identify correct subsets of matches contain-ing less than 1% inliers among 99% outliers. Manywell-known robust fitting methods, such as RANSACor Least Median of Squares, perform poorly when thepercent of inliers falls much below 50%. Fortunately,much better performance can be obtained by cluster-ing features in pose space using the Hough transform(Hough, 1962; Ballard, 1981; Grimson, 1990).

The Hough transform identifies clusters of featureswith a consistent interpretation by using each feature tovote for all object poses that are consistent with the fea-ture. When clusters of features are found to vote for the

same pose of an object, the probability of the interpre-tation being correct is much higher than for any singlefeature. Each of our keypoints specifies 4 parameters:2D location, scale, and orientation, and each matchedkeypoint in the database has a record of the keypoint’sparameters relative to the training image in which itwas found. Therefore, we can create a Hough trans-form entry predicting the model location, orientation,and scale from the match hypothesis. This predictionhas large error bounds, as the similarity transform im-plied by these 4 parameters is only an approximation tothe full 6 degree-of-freedom pose space for a 3D objectand also does not account for any non-rigid deforma-tions. Therefore, we use broad bin sizes of 30 degreesfor orientation, a factor of 2 for scale, and 0.25 times themaximum projected training image dimension (usingthe predicted scale) for location. To avoid the problemof boundary effects in bin assignment, each keypointmatch votes for the 2 closest bins in each dimension,giving a total of 16 entries for each hypothesis and fur-ther broadening the pose range.

In most implementations of the Hough transform, amulti-dimensional array is used to represent the bins.However, many of the potential bins will remain empty,and it is difficult to compute the range of possible binvalues due to their mutual dependence (for example,the dependency of location discretization on the se-lected scale). These problems can be avoided by usinga pseudo-random hash function of the bin values to in-sert votes into a one-dimensional hash table, in whichcollisions are easily detected.

7.4. Solution for Affine Parameters

The Hough transform is used to identify all clusterswith at least 3 entries in a bin. Each such cluster is thensubject to a geometric verification procedure in whicha least-squares solution is performed for the best affineprojection parameters relating the training image to thenew image.

An affine transformation correctly accounts for 3Drotation of a planar surface under orthographic projec-tion, but the approximation can be poor for 3D rotationof non-planar objects. A more general solution wouldbe to solve for the fundamental matrix (Luong andFaugeras, 1996; Hartley and Zisserman, 2000). How-ever, a fundamental matrix solution requires at least 7point matches as compared to only 3 for the affine so-lution and in practice requires even more matches forgood stability. We would like to perform recognition

106 Lowe

with as few as 3 feature matches, so the affine solutionprovides a better starting point and we can accountfor errors in the affine approximation by allowing forlarge residual errors. If we imagine placing a spherearound an object, then rotation of the sphere by 30 de-grees will move no point within the sphere by morethan 0.25 times the projected diameter of the sphere.For the examples of typical 3D objects used in this pa-per, an affine solution works well given that we allowresidual errors up to 0.25 times the maximum projecteddimension of the object. A more general approach isgiven in Brown and Lowe (2002), in which the initialsolution is based on a similarity transform, which thenprogresses to solution for the fundamental matrix inthose cases in which a sufficient number of matchesare found.

The affine transformation of a model point [x y]T toan image point [u v]T can be written as

[u

v

]=

[m1 m2

m3 m4

][x

y

]+

[tx

ty

]

where the model translation is [tx ty]T and the affinerotation, scale, and stretch are represented by the mi

parameters.We wish to solve for the transformation parameters,

so the equation above can be rewritten to gather theunknowns into a column vector:

⎡⎢⎢⎢⎣

x y 0 0 1 0

0 0 x y 0 1

. . .

. . .

⎤⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

m1

m2

m3

m4

tx

ty

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎣

u

v

...

⎤⎥⎥⎦

This equation shows a single match, but any numberof further matches can be added, with each match con-tributing two more rows to the first and last matrix. Atleast 3 matches are needed to provide a solution.

We can write this linear system as

Ax = b

The least-squares solution for the parameters x canbe determined by solving the corresponding normalequations,

x = [ATA]−1ATb,

which minimizes the sum of the squares of the distancesfrom the projected model locations to the correspond-ing image locations. This least-squares approach couldreadily be extended to solving for 3D pose and internalparameters of articulated and flexible objects (Lowe,1991).

Outliers can now be removed by checking for agree-ment between each image feature and the model. Giventhe more accurate least-squares solution, we now re-quire each match to agree within half the error rangethat was used for the parameters in the Hough trans-form bins. If fewer than 3 points remain after discardingoutliers, then the match is rejected. As outliers are dis-carded, the least-squares solution is re-solved with theremaining points, and the process iterated. In addition,a top-down matching phase is used to add any furthermatches that agree with the projected model position.These may have been missed from the Hough trans-form bin due to the similarity transform approximationor other errors.

The final decision to accept or reject a model hypoth-esis is based on a detailed probabilistic model givenin a previous paper (Lowe, 2001). This method firstcomputes the expected number of false matches to themodel pose, given the projected size of the model, thenumber of features within the region, and the accuracyof the fit. A Bayesian analysis then gives the probabil-ity that the object is present based on the actual numberof matching features found. We accept a model if thefinal probability for a correct interpretation is greaterthan 0.98. For objects that project to small regions of animage, 3 features may be sufficient for reliable recog-nition. For large objects covering most of a heavilytextured image, the expected number of false matchesis higher, and as many as 10 feature matches may benecessary.

8. Recognition Examples

Figure 12 shows an example of object recognition fora cluttered and occluded image containing 3D objects.The training images of a toy train and a frog are shownon the left. The middle image (of size 600 × 480 pix-els) contains instances of these objects hidden behindothers and with extensive background clutter so thatdetection of the objects may not be immediate even forhuman vision. The image on the right shows the finalcorrect identification superimposed on a reduced con-trast version of the image. The keypoints that were usedfor recognition are shown as squares with an extra line


Figure 12. The training images for two objects are shown on the left. These can be recognized in a cluttered image with extensive occlusion,shown in the middle. The results of recognition are shown on the right. A parallelogram is drawn around each recognized object showing theboundaries of the original training image under the affine transformation solved for during recognition. Smaller squares indicate the keypointsthat were used for recognition.

Figure 13. This example shows location recognition within a complex scene. The training images for locations are shown at the upper left andthe 640 × 315 pixel test image taken from a different viewpoint is on the upper right. The recognized regions are shown on the lower image,with keypoints shown as squares and an outer parallelogram showing the boundaries of the training images under the affine transform used forrecognition.

108 Lowe

to indicate orientation. The sizes of the squares corre-spond to the image regions used to construct the de-scriptor. An outer parallelogram is also drawn aroundeach instance of recognition, with its sides correspond-ing to the boundaries of the training images projectedunder the final affine transformation determined duringrecognition.

Another potential application of the approach is toplace recognition, in which a mobile device or vehiclecould identify its location by recognizing familiar lo-cations. Figure 13 gives an example of this application,in which training images are taken of a number of loca-tions. As shown on the upper left, these can even be ofsuch seemingly non-distinctive items as a wooden wallor a tree with trash bins. The test image (of size 640 by315 pixels) on the upper right was taken from a view-point rotated about 30 degrees around the scene fromthe original positions, yet the training image locationsare easily recognized.

All steps of the recognition process can be imple-mented efficiently, so the total time to recognize allobjects in Figs. 12 or 13 is less than 0.3 seconds ona 2 GHz Pentium 4 processor. We have implementedthese algorithms on a laptop computer with attachedvideo camera, and have tested them extensively overa wide range of conditions. In general, textured planarsurfaces can be identified reliably over a rotation indepth of up to 50 degrees in any direction and underalmost any illumination conditions that provide suf-ficient light and do not produce excessive glare. For3D objects, the range of rotation in depth for reliablerecognition is only about 30 degrees in any directionand illumination change is more disruptive. For thesereasons, 3D object recognition is best performed byintegrating features from multiple views, such as withlocal feature view clustering (Lowe, 2001).

These keypoints have also been applied to the prob-lem of robot localization and mapping, which has beenpresented in detail in other papers (Se et al., 2001). Inthis application, a trinocular stereo system is used to de-termine 3D estimates for keypoint locations. Keypointsare used only when they appear in all 3 images with con-sistent disparities, resulting in very few outliers. As therobot moves, it localizes itself using feature matchesto the existing 3D map, and then incrementally addsfeatures to the map while updating their 3D positionsusing a Kalman filter. This provides a robust and ac-curate solution to the problem of robot localization inunknown environments. This work has also addressedthe problem of place recognition, in which a robot can

be switched on and recognize its location anywherewithin a large map (Se et al., 2002), which is equiva-lent to a 3D implementation of object recognition.

9. Conclusions

The SIFT keypoints described in this paper are par-ticularly useful due to their distinctiveness, which en-ables the correct match for a keypoint to be selectedfrom a large database of other keypoints. This distinc-tiveness is achieved by assembling a high-dimensionalvector representing the image gradients within a localregion of the image. The keypoints have been shownto be invariant to image rotation and scale and robustacross a substantial range of affine distortion, additionof noise, and change in illumination. Large numbers ofkeypoints can be extracted from typical images, whichleads to robustness in extracting small objects amongclutter. The fact that keypoints are detected over a com-plete range of scales means that small local features areavailable for matching small and highly occluded ob-jects, while large keypoints perform well for imagessubject to noise and blur. Their computation is effi-cient, so that several thousand keypoints can be ex-tracted from a typical image with near real-time per-formance on standard PC hardware.

This paper has also presented methods for using thekeypoints for object recognition. The approach we havedescribed uses approximate nearest-neighbor lookup, aHough transform for identifying clusters that agree onobject pose, least-squares pose determination, and finalverification. Other potential applications include viewmatching for 3D reconstruction, motion tracking andsegmentation, robot localization, image panorama as-sembly, epipolar calibration, and any others that requireidentification of matching locations between images.

There are many directions for further research inderiving invariant and distinctive image features. Sys-tematic testing is needed on data sets with full 3D view-point and illumination changes. The features describedin this paper use only a monochrome intensity image,so further distinctiveness could be derived from includ-ing illumination-invariant color descriptors (Funt andFinlayson, 1995; Brown and Lowe, 2002). Similarly,local texture measures appear to play an important rolein human vision and could be incorporated into fea-ture descriptors in a more general form than the singlespatial frequency used by the current descriptors. Anattractive aspect of the invariant local feature approach


to matching is that there is no need to select just onefeature type, and the best results are likely to be ob-tained by using many different features, all of whichcan contribute useful matches and improve overallrobustness.

Another direction for future research will be to in-dividually learn features that are suited to recognizingparticular objects categories. This will be particularlyimportant for generic object classes that must cover abroad range of possible appearances. The research ofWeber et al. (2000) and Fergus et al. (2003) has shownthe potential of this approach by learning small setsof local features that are suited to recognizing genericclasses of objects. In the long term, feature sets arelikely to contain both prior and learned features thatwill be used according to the amount of training datathat has been available for various object classes.

Acknowledgments

I would particularly like to thank Matthew Brown, whohas suggested numerous improvements to both the con-tent and presentation of this paper and whose own workon feature localization and invariance has contributedto this approach. In addition, I would like to thank manyothers for their valuable suggestions, including StephenSe, Jim Little, Krystian Mikolajczyk, Cordelia Schmid,Tony Lindeberg, and Andrew Zisserman. This researchwas supported by the Natural Sciences and EngineeringResearch Council of Canada (NSERC) and through theInstitute for Robotics and Intelligent Systems (IRIS)Network of Centres of Excellence.

References

Arya, S. and Mount, D.M. 1993. Approximate nearest neighborqueries in fixed dimensions. In Fourth Annual ACM-SIAM Sym-posium on Discrete Algorithms (SODA’93), pp. 271–280.

Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., and Wu,A.Y. 1998. An optimal algorithm for approximate nearest neighborsearching. Journal of the ACM, 45:891–923.

Ballard, D.H. 1981. Generalizing the Hough transform to detect ar-bitrary patterns. Pattern Recognition, 13(2):111–122.

Basri, R. and Jacobs, D.W. 1997. Recognition using regioncorrespondences. International Journal of Computer Vision,25(2):145–166.

Baumberg, A. 2000. Reliable feature matching across widely sepa-rated views. In Conference on Computer Vision and Pattern Recog-nition, Hilton Head, South Carolina, pp. 774–781.

Beis, J. and Lowe, D.G. 1997. Shape indexing using approxi-mate nearest-neighbour search in high-dimensional spaces. In

Conference on Computer Vision and Pattern Recognition, PuertoRico, pp. 1000–1006.

Brown, M. and Lowe, D.G. 2002. Invariant features from inter-est point groups. In British Machine Vision Conference, Cardiff,Wales, pp. 656–665.

Carneiro, G. and Jepson, A.D. 2002. Phase-based local features. InEuropean Conference on Computer Vision (ECCV), Copenhagen,Denmark, pp. 282–296.

Crowley, J.L. and Parker, A.C. 1984. A representation for shape basedon peaks and ridges in the difference of low-pass transform. IEEETrans. on Pattern Analysis and Machine Intelligence, 6(2):156–170.

Edelman, S., Intrator, N., and Poggio, T. 1997. Complexcells and object recognition. Unpublished manuscript:http://kybele.psych.cornell.edu/∼edelman/archive.

html

Fergus, R., Perona, P., and Zisserman, A. 2003. Object class recogni-tion by unsupervised scale-invariant learning. In IEEE Conferenceon Computer Vision and Pattern Recognition, Madison, Wiscon-sin, pp. 264–271.

Friedman, J.H., Bentley, J.L., and Finkel, R.A. 1977. An algorithmfor finding best matches in logarithmic expected time. ACM Trans-actions on Mathematical Software, 3(3):209–226.

Funt, B.V. and Finlayson, G.D. 1995. Color constant color index-ing. IEEE Trans. on Pattern Analysis and Machine Intelligence,17(5):522–529.

Grimson, E. 1990. Object Recognition by Computer: The Role ofGeometric Constraints, The MIT Press: Cambridge, MA.

Harris, C. 1992. Geometry from visual motion. In Active Vision, A.Blake and A. Yuille (Eds.), MIT Press, pp. 263–284.

Harris, C. and Stephens, M. 1988. A combined corner and edge detec-tor. In Fourth Alvey Vision Conference, Manchester, UK, pp. 147–151.

Hartley, R. and Zisserman, A. 2000. Multiple view geometry in com-puter vision, Cambridge University Press: Cambridge, UK.

Hough, P.V.C. 1962. Method and means for recognizing complexpatterns. U.S. Patent 3069654.

Koenderink, J.J. 1984. The structure of images. Biological Cyber-netics, 50:363–396.

Lindeberg, T. 1993. Detecting salient blob-like image structuresand their scales with a scale-space primal sketch: A methodfor focus-of-attention. International Journal of Computer Vision,11(3):283–318.

Lindeberg, T. 1994. Scale-space theory: A basic tool for analysingstructures at different scales. Journal of Applied Statistics,21(2):224–270.

Lowe, D.G. 1991. Fitting parameterized three-dimensional modelsto images. IEEE Trans. on Pattern Analysis and Machine Intelli-gence, 13(5):441–450.

Lowe, D.G. 1999. Object recognition from local scale-invariant fea-tures. In International Conference on Computer Vision, Corfu,Greece, pp. 1150–1157.

Lowe, D.G. 2001. Local feature view clustering for 3D object recog-nition. IEEE Conference on Computer Vision and Pattern Recog-nition, Kauai, Hawaii, pp. 682–688.

Luong, Q.T. and Faugeras, O.D. 1996. The fundamental matrix: The-ory, algorithms, and stability analysis. International Journal ofComputer Vision, 17(1):43–76.

Matas, J., Chum, O., Urban, M., and Pajdla, T. 2002. Robustwide baseline stereo from maximally stable extremal regions.

110 Lowe

In British Machine Vision Conference, Cardiff, Wales, pp. 384–393.

Mikolajczyk, K. 2002. Detection of local features invariant to affinetransformations, Ph.D. thesis, Institut National Polytechnique deGrenoble, France.

Mikolajczyk, K. and Schmid, C. 2002. An affine invariant inter-est point detector. In European Conference on Computer Vision(ECCV), Copenhagen, Denmark, pp. 128–142.

Mikolajczyk, K., Zisserman, A., and Schmid, C. 2003. Shape recog-nition with edge-based features. In Proceedings of the British Ma-chine Vision Conference, Norwich, U.K.

Moravec, H. 1981. Rover visual obstacle avoidance. In InternationalJoint Conference on Artificial Intelligence, Vancouver, Canada,pp. 785–790.

Nelson, R.C. and Selinger, A. 1998. Large-scale tests of a keyed,appearance-based 3-D object recognition system. Vision Research,38(15):2469–2488.

Pope, A.R. and Lowe, D.G. 2000. Probabilistic models of appearancefor 3-D object recognition. International Journal of ComputerVision, 40(2):149–167.

Pritchard, D. and Heidrich, W. 2003. Cloth motion capture. ComputerGraphics Forum (Eurographics 2003), 22(3):263–271.

Schaffalitzky, F. and Zisserman, A. 2002. Multi-view matching forunordered image sets, or ‘How do I organize my holiday snaps?”’In European Conference on Computer Vision, Copenhagen, Den-mark, pp. 414–431.

Schiele, B. and Crowley, J.L. 2000. Recognition without correspon-dence using multidimensional receptive field histograms. Interna-tional Journal of Computer Vision, 36(1):31–50.

Schmid, C. and Mohr, R. 1997. Local grayvalue invariants for imageretrieval. IEEE Trans. on Pattern Analysis and Machine Intelli-gence, 19(5):530–534.

Se, S., Lowe, D.G., and Little, J. 2001. Vision-based mobile robotlocalization and mapping using scale-invariant features. In Inter-national Conference on Robotics and Automation, Seoul, Korea,pp. 2051–2058.

Se, S., Lowe, D.G., and Little, J. 2002. Global localization usingdistinctive visual features. In International Conference on Intel-ligent Robots and Systems, IROS 2002, Lausanne, Switzerland,pp. 226–231.

Shokoufandeh, A., Marsic, I., and Dickinson, S.J. 1999. View-basedobject recognition using saliency maps. Image and Vision Com-puting, 17:445–460.

Torr, P. 1995. Motion segmentation and outlier detection, Ph.D. The-sis, Dept. of Engineering Science, University of Oxford, UK.

Tuytelaars, T. and Van Gool, L. 2000. Wide baseline stereo based onlocal, affinely invariant regions. In British Machine Vision Con-ference, Bristol, UK, pp. 412–422.

Weber, M., Welling, M., and Perona, P. 2000. Unsupervised learningof models for recognition. In European Conference on ComputerVision, Dublin, Ireland, pp. 18–32.

Witkin, A.P. 1983. Scale-space filtering. In International Joint Con-ference on Artificial Intelligence, Karlsruhe, Germany, pp. 1019–1022.

Zhang, Z., Deriche, R., Faugeras, O., and Luong, Q.T. 1995. A robusttechnique for matching two uncalibrated images through the re-covery of the unknown epipolar geometry. Artificial Intelligence,78:87–119.

distinctive image features from scale-invariant …ee225b/sp14/readings/...sistent scale for feature...

Documents