feature detection and descriptors charles hatt nisha kiran lulu zhang

Feature Detection and Descriptors

Charles HattNisha KiranLulu Zhang

Overview

• Background– Motivation– Timeline and related work

• SIFT / SIFT Extensions– PCA – SIFT– GLOH

• DAISY• Performance Evaluation

Scope

• We cover local descriptors• Basic Procedure: – Find patches or key points– Compute a descriptor– Match to other points

• Local vs Global:– Robust to occlusion and clutter– Stable under image transforms

Color Histogram: A Global Descriptor

Motivation

Object Recognition

Robot Self Localization

• DARPA urban challenge, cars can recognize four way stops

Image Retrieval

Tracking

Things We Did in Class

• Image stitching• Image alignment

Good Descriptors are Invariant to

Timeline

• Cross correlation• Canny Edge Detector 1986• Harris Corner Detector 1988• Moment Invariants 1991• SIFT 1999• Shape Context 2002• PCA-SIFT 2004• Spin Images 2005• GLOH 2005• Daisy 2008

Cross Correlation

Moment Invariants

• a = degreep + q = orderId(x, y) = image gradient in direction d

d = horizontal or vertical• Invariant to convolution, blurring, affine transforms;

can compute any order or degree• Higher order sensitive to small photometric distortions

Spin Images (Johnson 97)

Spin Images (Lazebnik 05)

• Normalized patch implies invariant to intensity changes; invariant to rotation.

• Usually there are 10 bins for intensity, 5 bins for distance from center in the histogram.

• Descriptor is 50 elements.

Shape Context

Scale Invariant Features

Characteristic of good features• Repeatability– The same feature can be found in several images

despite the geometric and photometric transformation

• Saliency– Each feature has a distinctive description

• Compactness and efficiency– Many fewer features than image pixels

• Locality– Features occupy very small area of the image, robust

to clutter and occlusion

Good features - Corners in image

• Harris corner detector • Key idea: in the region around the corner, image

gradient has two or more dominant directions• Invariant to – Rotation– Partially to affine intensity change

• I = I + b (Invariant) – Only derivatives are used• I = a*I (Not in this case)

– Not invariant to scale

Not invariant to scale

All points will be classified as edges

Corner !

Scale Invariant Detection

• Consider regions (e.g. circles) of different sizes around a point

• Regions of corresponding sizes will look the same in both images

Scale invariant feature detection

• Goal: independently detect corresponding regions in scaled versions of the same image

• Need scale selection mechanism for finding characteristic region size that is covariant with the image transformation

Recall: Edge detection

• Convolution with derivative of Gaussian => Edge at maximum of derivative

• Convolution with second derivative of Gaussian => Edge at zero crossing

f

dg/dx

f*dg/dx

Edge

Derivative ofGaussian

Edge = maximumof derivative

f Edge

Second derivative of Gaussian(Laplacian)

Edge = Zero crossingof second derivative

Scale selection

• Define the characteristic scale as the scale that produces peak of Laplacian response

SIFT stages

• Scale space extrema detection• Keypoint localization• Orientation assignment• Keypoint descriptor

Scale space extrema detection• Approximate Laplacian of Gaussian with Difference of

Gaussian – Computationally less intensive– Invariant to scale

• Images of the same size(vertical) form an octave. Each octave have certain level of blurred images.

Maxima/Minima selection in DoG

SIFT: Find the local maxima of difference of Gaussian in space and scale

Keypoint localization

• Lot of keypoints detected• Sub pixel localization: Accurate location of

keypoints • Eliminating points with low contrast• Eliminating edge responses

Sub pixel localization

Eliminating extra keypoints

• If the magnitude of intensity at the current pixel in the DoG image (that is being checked for maxima/minima) is less than a certain value, it is rejected.

• Removing edges – Idea similar to Harris corner detector

• Until now, we have seen scale invariance• Now, let’s make the keypoint rotation

invariant

Orientation assignment

• Key idea: Collect gradient directions and magnitudes around each keypoint. Then figure out the most prominent orientations in that region. Assign these orientations to the keypoint

• Size of the orientation collection region depends on the scale. Bigger the scale, bigger the collection region.

• Compute gradient magnitude and orientations for each pixel and then construct a histogram

• Peak of the histogram taken as the keypoint orientation

Keypoint descriptor

• Based on 16*16 patches• 4*4 subregions• 8 bins in each subregion• 4*4*8=128 dimensions in total

PCA-SIFT

• PCA-SIFT is a modification of SIFT, which changes how the keypoint descriptors are constructed

• Basic Idea: Use PCA(Principal Component Analysis) to represent the gradient patch around the keypoint

• PCA stages– Computing projection matrix– Constructing PCA-SIFT descriptor

Computing projection matrix

• Select a representative set of pictures and detect all keypoints in these pictures

• For each keypoint– Extract an image patch around it with size 41*41 pixels– Calculate horizontal and vertical gradients, resulting in

a vector of size 39*39*2 = 3042• Put all these vectors into a k*3042 matrix A where

k is the number of keypoints detected• Calculate the covariance matrix of A

Contd..

• Compute the eigenvectors and the eigenvalues of cov A.

• Select the first n eigenvectors; the projection matrix is a n*3042 matrix composed of these eigenvectors

• The projection matrix is only computed once and saved.

Dimension reduction through PCA

The image patches do not span the entire space of pixel values, and also not the Smaller space of patches from natural images. They consist of highly restricted set of patches that passed the first three stages of SIFT.

Constructing PCA-SIFT descriptor

• Input: location of keypoint, scale, orientation.• Extract 41*41 patch around the keypoint at

the given scale, rotated to its orientation• Calculate 39*39 horizontal and vertical

gradients, resulting in a vector of size 3042• Multiply this vector using the precomputed

n*3042 projection matrix• This results in a PCA-SIFT descriptor of size n

Eigenspace construction

Effect of PCA dimension

Hypothesis:First several components of the PCA subspace are sufficient for encoding variations caused by keypoint identity, while the later components represent details that are not useful, of potentially detrimental, such as distortion from projective warp

Gradient Location Orientation Histogram

• Another SIFT – Extension• Gradients quantized into 16 bins• Log Polar location grid – 3 bins for radius: 6, 11, 15– 8 bins for direction: 0, π/4, π/2, … 7π/4

GLOH

17 location bins, 16 gradient bins per location bin272 elements -> down to 128 with PCA

GLOH Results

192 correct, 208 false positive… Not as bad as it sounds

DAISY

• An efficient dense local descriptor• Similar to SIFT and GLOH – Descriptor is fundamentally based on pixel

gradient histograms– Has key differences that make it much faster for

dense matching.• Original application: Wide-baseline stereo• Other applications: Face Recognition

SIFT GLOH*

+ Good Performance+ Better Localization- Not suitable for dense computation

+ Good Performance- Not suitable for dense computation

* K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. PAMI’04.

DAISY

+ Suitable for dense computation + Improved performance:*

+ Precise localization+ Rotational Robustness

DAISY

• Parameters:

Computation Steps

• Compute H (number of histogram bins) orientation maps, G_i (0 < i < H), one for each gradient orientation

• G_o(u,v) = Gradient norm in direction o, at pixel u,v. If gradient norm is < 0, G_o(u,v) = 0

Computation Steps

• Each orientation map is then repeatedly convolved with a Gaussian kernel to obtain “convolved” orientation maps.

Computation Steps

• There are a total of Q (the number of ‘rings’ in the DAISY) levels of convolution.

Computation Steps

• Each pixel now has Q vectors, each H long, of the form:

• Each of these vectors in normalized to unit norm, which helps preserve viewpoint invariance.

• Full Descriptor• Total Size– Q*T*H + H– In this case, 200

Computational Complexity

Configuration: H=8, T=8, Q=3S=25

122 Multiplications/pixel119 Summations/pixel25 Sampling/pixel

Configuration: 16 4x4 arrays, 8 bins.

1280 Multiplications/pixel512 Summations/pixel256 Sampling/pixel

DAISY SIFT

DAISY vs SIFT• Computation Time:

• Reasons:– DAISY descriptors share histograms.– Computation pipeline enables efficient memory

access pattern and histogram layers are separated early • Easily parallelized

Performance with parallel cores

• Computation time falls almost linearly.

Choosing the Best DAISY

• Winder et al.• Tested a wide variety of gradient and steerable

filter based configurations for calculating image gradients at each pixel and found the best parameters for each.

• Found best configurations for different applications.


• Real-time applications: – DAISY configuration:• 1 or 2 rings• 4 bins

– Rectification of image gradients to length one, no use of PCA and quantization of histogram values to a bit depth of 2-3.


• Applications requiring good discrimination: – 2nd order steerable filters at two spatial scales– Application of PCA

• Large-database applications (low storage requirements and computational burden)– Steerable filters with H=4 histogram bins, Q=2

rings, T= 8 segments– Rectified gradients with 4 histogram bins, Q=1 ring

and T= 8 segments

Reported Applications of DAISY

• Wide-baseline stereo• Face recognition

Depth Map Estimation

• DAISY descriptors used to measure similarities across images

• Graph cut based reconstruction algorithm used to generate maps.

• Occlusion masks are used to properly deal with occlusions

Occlusion maps

Depth Map Accuracy

• Ground truth – Laser scan

Depth Map Results

Face Recognition

• Dense descriptor computation is necessary for recognizing faces due to wide baseline nature of facial images.– DAISY descriptors calculated and matched using

recursive grid search– Matches distances are vectorized and input to a

Support Vector Machine (SVM)

Recognition Rate compared to previous, similar methods

Olivetti Research Lab Database

FERET Database

FERET Fafb – Varying facial expressions

FERET Fafb – Varying illumination

Local Descriptor Matching

• Methods for matching descriptor vectors– Exhaustive Search– Recursive Grid Search– KD trees

Recursive Grid Search• Finds the local descriptor for each section of the

template image in a grid (DT).• Find the local descriptor for the corresponding

section in the query image (DQ).• Distance is computed between DT and DQ, as well as

the descriptors of DQ’s neighbors at a distance d.• Point showing minimum distance (DT2) is consider

for further analysis.• Descriptors Neighbors for DT2, at a distance, d/2, are

calculated…

Recursive Grid Search

KD Trees

• Search for nearest neighbor of an n-dimensional point.

• Guaranteed to be log2 (n) depth• Has been shown to run in O(log n) average

time.• Pre-processing time is O(n log n)

KD Trees

Performance Evaluation

• Mikolajczyk Schmid 2005• Detecting • Normalizing • Describing• Matching• Graphing

Detecting

• 10 descriptors will be tested, but first what will they be tested on

• Harris points• Harris Laplace• Hessian Laplace• Harris Affine• Hessian Affine

Normalizing

• With respect to size: 41 pixels• Orientation: Dominant gradient• Illumination: normalize standard deviation and

mean of pixel intensities

Matching

• For histogram based methods, Euclidean distance

• For non-histogram based methods, Mahalanobis distance (S = covariance matrix)

• After distance is calculated, two regions match if D < threshold

• Nearest neighbor threshold

Data Set

• Original image is subjected to …• Rotations: 30-45 deg around optical axis• Scale: camera zoom 2 – 2.5x• Blur: defocusing• Viewpoint: frontal to foreshortened• Light: aperture varied• Compression: JPEG at 5% quality

Evaluation Criteria

• For each patch, compute distance; does d<t?• Compare to ground truth• Count number of correct and false matches• Recall vs 1-Precision graphs

• To build curves, change t and repeat. Now you can use recall and 1-precision to build graphs.

Notes

• If recall = 1 for any precision, we have a perfect descriptor

• Slowly increasing curve => descriptor is affected by the type of noise or transformation we applied to it

• Generally, if the curve for one type of descriptor is higher than the other, it is more robust to that type of transformation

Hessian-Affine detector on Structured Scene

Hessian Laplace Regions

Conclusion

• Feature detection and descriptors through the ages

feature detection and descriptors charles hatt nisha kiran lulu zhang

Documents

image retrieval

image transformations

past global descriptors

good descriptors

object recognition

urban challenge cars

background feature detection

harris corner detector