details vision slam sift · sift slam vision details mit 16.412j spring 2004 vikash k. mansinghka 1
TRANSCRIPT
SIFT SLAM Vision Details
MIT 16.412J Spring 2004
Vikash K. Mansinghka
1
Outline
• Lightning Summary
• Black Box Model of SIFT SLAM Vision System
• Challenges in Computer Vision
• What these challenges mean for visual SLAM
• How SIFT extracts candidate landmarks
• How landmarks are tracked in SIFT SLAM
• Alternative vision-based SLAM systems
• Open questions
2
Lightning Summary
• Motivation: SLAM without modifying the environment
• Landmark candidates are extracted by the SIFT process
• Candidates matched between cameras to get 3D positions
• Candidates pruned according to consistency w/ robot’s
expectations
• Survivors sent off for statistical processing
3
Review of Robot Specifications
• Triclops 3-camera “stereo” vision system
• Odometry system which produces [p, q, �]
• Center camera is “reference”
4
Black Box Model of Vision System
• For now, based on black-magic (SIFT). Produces landmarks.
• Assume landmarks globally indexed by i.
• Per frame inputs:
– [p, q, �] - odometry input (x, z, bearing deltas.)
– List of (i, xi) - new landmark pos (from SLAM)
• Per frame output is a list of (i, x
landmark i where:
� ) for each visible , x , r , ci i ii
– x� i is its measured 3D pos (w.r.t. camera pos)
– xi is its map 3D pos (w.r.t. initial robot pos), if it isn’t new
– (ri, ci) is its pixel coordinates in center camera
5
Challenges in Computer Vision
• Intuitively appealing �= computationally realizable
• Stable feature extraction is hard; results rarely general
• Extracted features are sparse
• Matching requires exponential time
• Matches are often wrong
6
Implications for Visual SLAM
• Hard to reliably find landmarks
• Really Hard to reliably find landmarks
• Really Really Hard to reliably find landmarks
• Data association is slow and unreliable
• False matches introduce substantial errors
• Accurate probabilistic models unavailable
7
Remarks on SIFT approach
• For visual SLAM, landmarks must be identifiable across:
– Large changes in distance
– Small changes in view direction
– (Bonus) Changes in illumination
• Solution:
– Produce “scale-invariant” image representation
– Extract points with associated scale information
– Use matcher empirically capable of handling small
displacements
8
The Scale-Invariant Feature Transform
• Described in Lowe, IJCV 2004 (preprint; use Google)
• Four stages:
– Scale-space extrema extraction
– Keypoint pruning and localization (not used in SLAM)
– Orientation assignment
– Keypoint descriptor (not used in SLAM)
9
Lightning Introduction to Scale Space
• Motivation:
– Objects can be recognized at many levels of detail
– Large distances correspond to low l.o.d.
– Different kinds of information are available at each level
10
Lightning Introduction to Scale Space
• Idea: Extract information content from an image at each l.o.d.
• Detail reduction typically done by Gaussian blurring
• Long history in both machine and human vision
– Marr in late 1970s
– Henkel in 2000
• Analogous concepts used in speech processing
0
5
10
15
20
0
5
10
15
200
0.002
0.004
0.006
0.008
0.01
0.012
11
Scale Space in SIFT
• I(x, y) is input image. L(x, y, �) is rep. at scale �.
• G(x, y, �) is 2D Gaussian with variance �2
• L(x, y, �) = G(x, y, �) � I(x, y) (“only” choice; see Koenderink
1984)
• D(x, y, �) = L(x, y, k�) − L(x, y, �)
• D approximates �2�
2G � I (see Mikolajczyk 2002 for
significance)
• D also edge-detector-like; newest SIFT “corrects” for this
• Details of discretization (e.g. resampling, k choice)
unimportant
12
Scale Space in SIFT
• Compute local extrema of D as above
• Each such (x, y, �) is a feature
• (x, y) part “should” be scale and planar rotation invariant
13
SIFT Orientation Assignment
• For each feature (x, y, �):
– Find fixed-pixel-area patch in L(x, y, �) around (x, y)
– Compute gradient histogram; call this bi
– For bi within 80% of max, make feature (x, y, �, bi)
• Enables matching by including illumination-invariant feature
content (Sinha 2000)
14
�
SIFT Stereopsis
• Apply SIFT to image from each camera.
• Match center feature (x, y, �, �) and right feature (x , y�, ��, ��)
if:
1. |y − y�| � 1
2. 0 < |x� − x| � 20
3. |� − ��| � 20 degrees2 � � � 34. 3 �� 2
5. No other matches consistent with above exist
• Match similarly for left and top; discard all not matched twice
• Compute 3D positions (trig) as average from horiz. and vert.
15
Recapitulation (in G Minor)
• Procedure so far:
1. For each image:
(a) Produce scale-space representation
(b) Find extrema
(c) Compute gradient orientation histograms
2. Match features from center to right and center to top
3. Compute relative 3D positions for survivors
• This gives us potential features from a given frame
• How do we use them?
17
Landmark Tracking
• Predict where landmarks should appear (reliability, speed)
• Note: Robot moves in xz plane
• Given [p, q, �] and old relative position [X, Y, Z], find expected position [X �, Y �, Z �] by:
X � = (X − p)cos(�) − (Z − q)sin(�)
Y � = Y
Z � = (X − p)sin(�) − (Z − q)cos(�)
• By pinhole camera model ((u0, v0) image center coords, I interocular distance, f focal length):� = v0 − f Y �
rZ�
� = u0 + f X�
Z�
Id� = f Z�
�� = � ZZ�
19
c
Landmark Tracking
• V is camera field of view angle (60 degrees)
• A landmark is expected to be in view if:
Z � > 0 | Vtan−1( |X
�
) < 2Z�
−1( |Y � |tan
Z� ) < V 2
• An expected landmark matches an observed landmark if:
– Obs. center within a 10x10 region around expected
– Obs. scale within 20% of expected
– Obs. orientation within 20 degrees of expected
– Obs. disparity within 20% of expected
20
Landmark Tracking
• A SIFT view is: (SIFT feature, relative 3D pos, absolute view dir)
• Each landmark is: (3D position, list of views, misses)
• Algorithm:
For each frame, find expected landmarks w/ odometry For each observed view v:
If v matches an expected landmark l:Set l.misses = 0Add v to view list for l
Else add l to DBFor each expected, unobserved landmark l:
If one view direction within 20 degrees of current: l.misses++ If l.misses >= 20, delete l from DB
22
Other Examples of Vision-based SLAM
• Ceiling lamp based (Panzieri et al. 2003)
• Bearing-only visual SLAM (lots; unclear to me)
• Monocular visual SLAM, e.g. by optical flow (lots; unclear to
me)
• Shi and Tomasi feature based (Davison and Murray 1998)
23
Open Questions
• How to speed vision processing? (Move to hardware?)
• Should full SLAM (not decorrelated SLAM) be used?
• If full SLAM, how can numerics be sped up? (FMM?)
• Can thresholds be automatically tuned? (Maximum
information?)
• Will movement confuse SIFT SLAM? (Fix by optical flow?)
• What environments foil SIFT SLAM?
• How to get dense features? (Geometric model fitting?)
• Can this handle large environments/long trajectories? (Fix by
qualitative navigation?)
• Why does David Lowe rock so hard?
24
Acknowledgements
• Technical content primarily from Lowe 2004 and Se, Lowe,
Little 2002. *
• Various images liberated from their papers and from the
internet.
• Others produced and/or processed by MATLAB, Lowe’s SIFT
demo and ImageMagick.
• My sincerest apologies to Terry Pratchett.
* Se, S., D. Lowe and J. Little, ”Mobile Robot Localization and Mapping with Uncertainty using Scale-Invariant Visual Landmarks”, The International Journal of Robotics Research, Volume 21 Issue 08.
25