Feature Detection and Descriptors
Charles HattNisha KiranLulu Zhang
Overview
• Background– Motivation– Timeline and related work
• SIFT / SIFT Extensions– PCA – SIFT– GLOH
• DAISY• Performance Evaluation
Scope
• We cover local descriptors• Basic Procedure: – Find patches or key points– Compute a descriptor– Match to other points
• Local vs Global:– Robust to occlusion and clutter– Stable under image transforms
Color Histogram: A Global Descriptor
Motivation
Object Recognition
Robot Self Localization
• DARPA urban challenge, cars can recognize four way stops
Image Retrieval
Image Retrieval
Tracking
Things We Did in Class
• Image stitching• Image alignment
Good Descriptors are Invariant to
Timeline
• Cross correlation• Canny Edge Detector 1986• Harris Corner Detector 1988• Moment Invariants 1991• SIFT 1999• Shape Context 2002• PCA-SIFT 2004• Spin Images 2005• GLOH 2005• Daisy 2008
Cross Correlation
Cross Correlation
Moment Invariants
• a = degreep + q = orderId(x, y) = image gradient in direction d
d = horizontal or vertical• Invariant to convolution, blurring, affine transforms;
can compute any order or degree• Higher order sensitive to small photometric distortions
Spin Images (Johnson 97)
Spin Images (Johnson 97)
Spin Images (Lazebnik 05)
• Normalized patch implies invariant to intensity changes; invariant to rotation.
• Usually there are 10 bins for intensity, 5 bins for distance from center in the histogram.
• Descriptor is 50 elements.
Shape Context
Scale Invariant Features
Characteristic of good features• Repeatability– The same feature can be found in several images
despite the geometric and photometric transformation
• Saliency– Each feature has a distinctive description
• Compactness and efficiency– Many fewer features than image pixels
• Locality– Features occupy very small area of the image, robust
to clutter and occlusion
Good features - Corners in image
• Harris corner detector • Key idea: in the region around the corner, image
gradient has two or more dominant directions• Invariant to – Rotation– Partially to affine intensity change
• I = I + b (Invariant) – Only derivatives are used• I = a*I (Not in this case)
– Not invariant to scale
Not invariant to scale
All points will be classified as edges
Corner !
Scale Invariant Detection
• Consider regions (e.g. circles) of different sizes around a point
• Regions of corresponding sizes will look the same in both images
Scale invariant feature detection
• Goal: independently detect corresponding regions in scaled versions of the same image
• Need scale selection mechanism for finding characteristic region size that is covariant with the image transformation
Recall: Edge detection
• Convolution with derivative of Gaussian => Edge at maximum of derivative
• Convolution with second derivative of Gaussian => Edge at zero crossing
f
dg/dx
f*dg/dx
Edge
Derivative ofGaussian
Edge = maximumof derivative
f Edge
Second derivative of Gaussian(Laplacian)
Edge = Zero crossingof second derivative
Scale selection
• Define the characteristic scale as the scale that produces peak of Laplacian response
SIFT stages
• Scale space extrema detection• Keypoint localization• Orientation assignment• Keypoint descriptor
Scale space extrema detection• Approximate Laplacian of Gaussian with Difference of
Gaussian – Computationally less intensive– Invariant to scale
• Images of the same size(vertical) form an octave. Each octave have certain level of blurred images.
Maxima/Minima selection in DoG
SIFT: Find the local maxima of difference of Gaussian in space and scale
Keypoint localization
• Lot of keypoints detected• Sub pixel localization: Accurate location of
keypoints • Eliminating points with low contrast• Eliminating edge responses
Sub pixel localization
Eliminating extra keypoints
• If the magnitude of intensity at the current pixel in the DoG image (that is being checked for maxima/minima) is less than a certain value, it is rejected.
• Removing edges – Idea similar to Harris corner detector
• Until now, we have seen scale invariance• Now, let’s make the keypoint rotation
invariant
Orientation assignment
• Key idea: Collect gradient directions and magnitudes around each keypoint. Then figure out the most prominent orientations in that region. Assign these orientations to the keypoint
• Size of the orientation collection region depends on the scale. Bigger the scale, bigger the collection region.
• Compute gradient magnitude and orientations for each pixel and then construct a histogram
• Peak of the histogram taken as the keypoint orientation
Keypoint descriptor
• Based on 16*16 patches• 4*4 subregions• 8 bins in each subregion• 4*4*8=128 dimensions in total
PCA-SIFT
• PCA-SIFT is a modification of SIFT, which changes how the keypoint descriptors are constructed
• Basic Idea: Use PCA(Principal Component Analysis) to represent the gradient patch around the keypoint
• PCA stages– Computing projection matrix– Constructing PCA-SIFT descriptor
Computing projection matrix
• Select a representative set of pictures and detect all keypoints in these pictures
• For each keypoint– Extract an image patch around it with size 41*41 pixels– Calculate horizontal and vertical gradients, resulting in
a vector of size 39*39*2 = 3042• Put all these vectors into a k*3042 matrix A where
k is the number of keypoints detected• Calculate the covariance matrix of A
Contd..
• Compute the eigenvectors and the eigenvalues of cov A.
• Select the first n eigenvectors; the projection matrix is a n*3042 matrix composed of these eigenvectors
• The projection matrix is only computed once and saved.
Dimension reduction through PCA
The image patches do not span the entire space of pixel values, and also not the Smaller space of patches from natural images. They consist of highly restricted set of patches that passed the first three stages of SIFT.
Constructing PCA-SIFT descriptor
• Input: location of keypoint, scale, orientation.• Extract 41*41 patch around the keypoint at
the given scale, rotated to its orientation• Calculate 39*39 horizontal and vertical
gradients, resulting in a vector of size 3042• Multiply this vector using the precomputed
n*3042 projection matrix• This results in a PCA-SIFT descriptor of size n
Eigenspace construction
Effect of PCA dimension
Hypothesis:First several components of the PCA subspace are sufficient for encoding variations caused by keypoint identity, while the later components represent details that are not useful, of potentially detrimental, such as distortion from projective warp
Gradient Location Orientation Histogram
• Another SIFT – Extension• Gradients quantized into 16 bins• Log Polar location grid – 3 bins for radius: 6, 11, 15– 8 bins for direction: 0, π/4, π/2, … 7π/4
GLOH
17 location bins, 16 gradient bins per location bin272 elements -> down to 128 with PCA
GLOH Results
192 correct, 208 false positive… Not as bad as it sounds
DAISY
• An efficient dense local descriptor• Similar to SIFT and GLOH – Descriptor is fundamentally based on pixel
gradient histograms– Has key differences that make it much faster for
dense matching.• Original application: Wide-baseline stereo• Other applications: Face Recognition
SIFT GLOH*
+ Good Performance+ Better Localization- Not suitable for dense computation
+ Good Performance- Not suitable for dense computation
* K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. PAMI’04.
DAISY
+ Suitable for dense computation + Improved performance:*
+ Precise localization+ Rotational Robustness
DAISY
• Parameters:
Computation Steps
• Compute H (number of histogram bins) orientation maps, G_i (0 < i < H), one for each gradient orientation
• G_o(u,v) = Gradient norm in direction o, at pixel u,v. If gradient norm is < 0, G_o(u,v) = 0
Computation Steps
• Each orientation map is then repeatedly convolved with a Gaussian kernel to obtain “convolved” orientation maps.
Computation Steps
• There are a total of Q (the number of ‘rings’ in the DAISY) levels of convolution.
Computation Steps
• Each pixel now has Q vectors, each H long, of the form:
• Each of these vectors in normalized to unit norm, which helps preserve viewpoint invariance.
• Full Descriptor• Total Size– Q*T*H + H– In this case, 200
Computational Complexity
Configuration: H=8, T=8, Q=3S=25
122 Multiplications/pixel119 Summations/pixel25 Sampling/pixel
Configuration: 16 4x4 arrays, 8 bins.
1280 Multiplications/pixel512 Summations/pixel256 Sampling/pixel
DAISY SIFT
DAISY vs SIFT• Computation Time:
• Reasons:– DAISY descriptors share histograms.– Computation pipeline enables efficient memory
access pattern and histogram layers are separated early • Easily parallelized
Performance with parallel cores
• Computation time falls almost linearly.
Choosing the Best DAISY
• Winder et al.• Tested a wide variety of gradient and steerable
filter based configurations for calculating image gradients at each pixel and found the best parameters for each.
• Found best configurations for different applications.
Choosing the Best DAISY
• Real-time applications: – DAISY configuration:• 1 or 2 rings• 4 bins
– Rectification of image gradients to length one, no use of PCA and quantization of histogram values to a bit depth of 2-3.
Choosing the Best DAISY
• Applications requiring good discrimination: – 2nd order steerable filters at two spatial scales– Application of PCA
• Large-database applications (low storage requirements and computational burden)– Steerable filters with H=4 histogram bins, Q=2
rings, T= 8 segments– Rectified gradients with 4 histogram bins, Q=1 ring
and T= 8 segments
Reported Applications of DAISY
• Wide-baseline stereo• Face recognition
Depth Map Estimation
• DAISY descriptors used to measure similarities across images
• Graph cut based reconstruction algorithm used to generate maps.
• Occlusion masks are used to properly deal with occlusions
Occlusion maps
Depth Map Accuracy
• Ground truth – Laser scan
Depth Map Results
Face Recognition
• Dense descriptor computation is necessary for recognizing faces due to wide baseline nature of facial images.– DAISY descriptors calculated and matched using
recursive grid search– Matches distances are vectorized and input to a
Support Vector Machine (SVM)
Recognition Rate compared to previous, similar methods
Olivetti Research Lab Database
FERET Database
FERET Fafb – Varying facial expressions
FERET Fafb – Varying illumination
Local Descriptor Matching
• Methods for matching descriptor vectors– Exhaustive Search– Recursive Grid Search– KD trees
Recursive Grid Search• Finds the local descriptor for each section of the
template image in a grid (DT).• Find the local descriptor for the corresponding
section in the query image (DQ).• Distance is computed between DT and DQ, as well as
the descriptors of DQ’s neighbors at a distance d.• Point showing minimum distance (DT2) is consider
for further analysis.• Descriptors Neighbors for DT2, at a distance, d/2, are
calculated…
Recursive Grid Search
KD Trees
• Search for nearest neighbor of an n-dimensional point.
• Guaranteed to be log2 (n) depth• Has been shown to run in O(log n) average
time.• Pre-processing time is O(n log n)
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
KD Trees
Performance Evaluation
• Mikolajczyk Schmid 2005• Detecting • Normalizing • Describing• Matching• Graphing
Detecting
• 10 descriptors will be tested, but first what will they be tested on
• Harris points• Harris Laplace• Hessian Laplace• Harris Affine• Hessian Affine
Normalizing
• With respect to size: 41 pixels• Orientation: Dominant gradient• Illumination: normalize standard deviation and
mean of pixel intensities
Matching
• For histogram based methods, Euclidean distance
• For non-histogram based methods, Mahalanobis distance (S = covariance matrix)
• After distance is calculated, two regions match if D < threshold
• Nearest neighbor threshold
Data Set
• Original image is subjected to …• Rotations: 30-45 deg around optical axis• Scale: camera zoom 2 – 2.5x• Blur: defocusing• Viewpoint: frontal to foreshortened• Light: aperture varied• Compression: JPEG at 5% quality
Evaluation Criteria
• For each patch, compute distance; does d<t?• Compare to ground truth• Count number of correct and false matches• Recall vs 1-Precision graphs
• To build curves, change t and repeat. Now you can use recall and 1-precision to build graphs.
Notes
• If recall = 1 for any precision, we have a perfect descriptor
• Slowly increasing curve => descriptor is affected by the type of noise or transformation we applied to it
• Generally, if the curve for one type of descriptor is higher than the other, it is more robust to that type of transformation
Hessian-Affine detector on Structured Scene
Hessian Laplace Regions
Conclusion
• Feature detection and descriptors through the ages