details vision slam sift · sift slam vision details mit 16.412j spring 2004 vikash k. mansinghka 1

SIFT SLAM Vision Details

MIT 16.412J Spring 2004

Vikash K. Mansinghka

1

Outline

• Lightning Summary

• Black Box Model of SIFT SLAM Vision System

• Challenges in Computer Vision

• What these challenges mean for visual SLAM

• How SIFT extracts candidate landmarks

• How landmarks are tracked in SIFT SLAM

• Alternative vision-based SLAM systems

• Open questions

2

Lightning Summary

• Motivation: SLAM without modifying the environment

• Landmark candidates are extracted by the SIFT process

• Candidates matched between cameras to get 3D positions

• Candidates pruned according to consistency w/ robot’s

expectations

• Survivors sent off for statistical processing

3

Review of Robot Specifications

• Triclops 3-camera “stereo” vision system

• Odometry system which produces [p, q, �]

• Center camera is “reference”

4

Black Box Model of Vision System

• For now, based on black-magic (SIFT). Produces landmarks.

• Assume landmarks globally indexed by i.

• Per frame inputs:

– [p, q, �] - odometry input (x, z, bearing deltas.)

– List of (i, xi) - new landmark pos (from SLAM)

• Per frame output is a list of (i, x

landmark i where:

� ) for each visible , x , r , ci i ii

– x� i is its measured 3D pos (w.r.t. camera pos)

– xi is its map 3D pos (w.r.t. initial robot pos), if it isn’t new

– (ri, ci) is its pixel coordinates in center camera

5

Challenges in Computer Vision

• Intuitively appealing �= computationally realizable

• Stable feature extraction is hard; results rarely general

• Extracted features are sparse

• Matching requires exponential time

• Matches are often wrong

6

Implications for Visual SLAM

• Hard to reliably find landmarks

• Really Hard to reliably find landmarks

• Really Really Hard to reliably find landmarks

• Data association is slow and unreliable

• False matches introduce substantial errors

• Accurate probabilistic models unavailable

7

Remarks on SIFT approach

• For visual SLAM, landmarks must be identifiable across:

– Large changes in distance

– Small changes in view direction

– (Bonus) Changes in illumination

• Solution:

– Produce “scale-invariant” image representation

– Extract points with associated scale information

– Use matcher empirically capable of handling small

displacements

8

The Scale-Invariant Feature Transform

• Described in Lowe, IJCV 2004 (preprint; use Google)

• Four stages:

– Scale-space extrema extraction

– Keypoint pruning and localization (not used in SLAM)

– Orientation assignment

– Keypoint descriptor (not used in SLAM)

9

Lightning Introduction to Scale Space

• Motivation:

– Objects can be recognized at many levels of detail

– Large distances correspond to low l.o.d.

– Different kinds of information are available at each level

10

Lightning Introduction to Scale Space

• Idea: Extract information content from an image at each l.o.d.

• Detail reduction typically done by Gaussian blurring

• Long history in both machine and human vision

– Marr in late 1970s

– Henkel in 2000

• Analogous concepts used in speech processing

0

5

10

15

20

0

5

10

15

200

0.002

0.004

0.006

0.008

0.01

0.012

11

Scale Space in SIFT

• I(x, y) is input image. L(x, y, �) is rep. at scale �.

• G(x, y, �) is 2D Gaussian with variance �2

• L(x, y, �) = G(x, y, �) � I(x, y) (“only” choice; see Koenderink

1984)

• D(x, y, �) = L(x, y, k�) − L(x, y, �)

• D approximates �2�

2G � I (see Mikolajczyk 2002 for

significance)

• D also edge-detector-like; newest SIFT “corrects” for this

• Details of discretization (e.g. resampling, k choice)

unimportant

12

Scale Space in SIFT

• Compute local extrema of D as above

• Each such (x, y, �) is a feature

• (x, y) part “should” be scale and planar rotation invariant

13

SIFT Orientation Assignment

• For each feature (x, y, �):

– Find fixed-pixel-area patch in L(x, y, �) around (x, y)

– Compute gradient histogram; call this bi

– For bi within 80% of max, make feature (x, y, �, bi)

• Enables matching by including illumination-invariant feature

content (Sinha 2000)

14

�

SIFT Stereopsis

• Apply SIFT to image from each camera.

• Match center feature (x, y, �, �) and right feature (x , y�, ��, ��)

if:

1. |y − y�| � 1

2. 0 < |x� − x| � 20

3. |� − ��| � 20 degrees2 � � � 34. 3 �� 2

5. No other matches consistent with above exist

• Match similarly for left and top; discard all not matched twice

• Compute 3D positions (trig) as average from horiz. and vert.

15

Recapitulation (in G Minor)

• Procedure so far:

1. For each image:

(a) Produce scale-space representation

(b) Find extrema

(c) Compute gradient orientation histograms

2. Match features from center to right and center to top

3. Compute relative 3D positions for survivors

• This gives us potential features from a given frame

• How do we use them?

17

Landmark Tracking

• Predict where landmarks should appear (reliability, speed)

• Note: Robot moves in xz plane

• Given [p, q, �] and old relative position [X, Y, Z], find expected position [X �, Y �, Z �] by:

X � = (X − p)cos(�) − (Z − q)sin(�)

Y � = Y

Z � = (X − p)sin(�) − (Z − q)cos(�)

• By pinhole camera model ((u0, v0) image center coords, I interocular distance, f focal length):� = v0 − f Y �

rZ�

� = u0 + f X�

Z�

Id� = f Z�

�� = � ZZ�

19

c

Landmark Tracking

• V is camera field of view angle (60 degrees)

• A landmark is expected to be in view if:

Z � > 0 | Vtan−1( |X

�

) < 2Z�

−1( |Y � |tan

Z� ) < V 2

• An expected landmark matches an observed landmark if:

– Obs. center within a 10x10 region around expected

– Obs. scale within 20% of expected

– Obs. orientation within 20 degrees of expected

– Obs. disparity within 20% of expected

20

Landmark Tracking

• A SIFT view is: (SIFT feature, relative 3D pos, absolute view dir)

• Each landmark is: (3D position, list of views, misses)

• Algorithm:

For each frame, find expected landmarks w/ odometry For each observed view v:

If v matches an expected landmark l:Set l.misses = 0Add v to view list for l

Else add l to DBFor each expected, unobserved landmark l:

If one view direction within 20 degrees of current: l.misses++ If l.misses >= 20, delete l from DB

22

Other Examples of Vision-based SLAM

• Ceiling lamp based (Panzieri et al. 2003)

• Bearing-only visual SLAM (lots; unclear to me)

• Monocular visual SLAM, e.g. by optical flow (lots; unclear to

me)

• Shi and Tomasi feature based (Davison and Murray 1998)

23

Open Questions

• How to speed vision processing? (Move to hardware?)

• Should full SLAM (not decorrelated SLAM) be used?

• If full SLAM, how can numerics be sped up? (FMM?)

• Can thresholds be automatically tuned? (Maximum

information?)

• Will movement confuse SIFT SLAM? (Fix by optical flow?)

• What environments foil SIFT SLAM?

• How to get dense features? (Geometric model fitting?)

• Can this handle large environments/long trajectories? (Fix by

qualitative navigation?)

• Why does David Lowe rock so hard?

24

Acknowledgements

• Technical content primarily from Lowe 2004 and Se, Lowe,

Little 2002. *

• Various images liberated from their papers and from the

internet.

• Others produced and/or processed by MATLAB, Lowe’s SIFT

demo and ImageMagick.

• My sincerest apologies to Terry Pratchett.

* Se, S., D. Lowe and J. Little, ”Mobile Robot Localization and Mapping with Uncertainty using Scale-Invariant Visual Landmarks”, The International Journal of Robotics Research, Volume 21 Issue 08.

25

details vision slam sift · sift slam vision details mit 16.412j spring 2004 vikash k. mansinghka 1

Documents