hierarchicalmetriclearningandmatchingfor2dand3dgeometriccorrespondenceshuy/lib/exe/fetch.php?media=... ·...
TRANSCRIPT
HierarchicalMetric Learning andMatching for 2D and 3DGeometric Correspondences?†
Mohammed E. Fathy1, Quoc-Huy Tran2, M. Zeeshan Zia3, Paul Vernaza2, and Manmohan Chandraker2,4
1Google Cloud AI 2NEC Laboratories America, Inc. 3Microsoft 4University of California, San Diego
Problems & Contributions
Semantic features+
Localization robust to appearance
variations
Geometric features
+Localization
sensitive to local structure
Hierarchical metric Learning and Matching (HiLM)
Our hierarchical metric learning andmatching (HiLM ) retain the bestproperties of various levels of abstrac-tion in CNN feature representations.Corresponding pixels are overlaid onthe input images with the same col-ors.
Problems: Given two RGB images capturing a common scene from different viewpoints, our task is to estimate2D geometric correspondences between the pixels in the images. In addition, we propose an extension tocomputing 3D geometric correspondences from two point clouds.
Contributions:• We derive useful insights for designing CNN-based correspondence methods, where both deep and shallow
features are exploited.
• Hierarchical metric losses at multiple CNN layers for correspondence learning.
• Hierarchical matching within CNN activation maps for refining correspondences.
• Experimental results show significant improvements over previous single-layer and feature fusion approaches.
• The improvements further translate to different data modalities (2D images, 3D point clouds) and generalizeacross various datasets (Sintel, HPatches, KITTI).
2D CNN Architecture for 2D Correspondence Estimation
ReLU/LRN/Max-Pool: 3x3/S2
Conv-2: 256 5x5/S1
Conv-3: 512 3x3/D4
Conv-4: 512 3x3/D4
ReLU/LRN/Max-Pool: 5x5/S1
Conv-5: 512 3x3/D4
Conv-1: 96 7x7/S2
ReLU/LRN/Max-Pool: 3x3/S2
Conv-2: 256 5x5/S1
Conv-3: 512 3x3/D4
Conv-4: 512 3x3/D4
ReLU/LRN/Max-Pool: 5x5/S1
Conv-5: 512 3x3/D4
Conv-1: 96 7x7/S2
FE-Conv: 128 3x3
L2-Normalize
FE-Conv: 128 3x3
L2-Normalize Hard-Negative
Mining
CCL Loss
FE-Conv: 128 1x1 FE-Conv: 128 1x1
L2-Normalize L2-Normalize
Hard-Negative Mining
CCL Loss
Coarse Matching
Constrained Matching
Precise matches
Testing
Training
Deep supervision improves generalization by encouraging the CNN early on to learn task-relevant features.Also, both deep and shallow layers can be supervised simultaneously within one CNN.
Hard Negative Mining happens “on-the-fly” to speed up training and leverage the latest instance of networkweights. Hard negative mining is employed independently for each of the feature levels being supervised.
Correspondence Contrastive Loss (CCL):
L :=L∑
l=1
∑(x,x′,y)∈D
y . d2l (x, x′) + (1− y) . (max(0,m− dl(x, x′)))2.
?Part of this work was done during M. E. Fathy’s internship at NEC Labs America in Cupertino, CA.The authors thank C. B. Choy and A. Zeng for their help with the code of UCN and 3DMatch respectively.
Baseline Architectures Inspired by Previous Works
ReLU/LRN/Max-Pool: 3x3/S2
Conv-2: 256 5x5/S1
Conv-3: 512 3x3/D4
ReLU/LRN/Max-Pool: 5x5/S1
Conv-1: 96 7x7/S2
L2-Normalize
CCL Loss
ReLU/LRN/Max-Pool: 3x3/S2
Conv-2: 256 5x5/S1
Conv-3: 512 3x3/D4
Conv-4: 512 3x3/D4
ReLU/LRN/Max-Pool: 5x5/S1
Conv-5: 512 3x3/D4
Conv-1: 96 7x7/S2 Max-Pool: 1x1/S2
Concat
FE-Conv:
128 1x1
L2-Normalize
CCL Loss
(a) conv3-net (b) hypercolumn-fusion
ReLU/LRN/Max-Pool: 3x3/S2
Conv-2: 256 5x5/S1
Conv-3: 512 3x3/D2
Conv-4: 512 3x3/D2/S2
ReLU/LRN/Max-Pool: 5x5/S2
Conv-5: 512 3x3
Conv-1: 96 7x7/S2
Upsample x 2
Concat
FE-Conv: 128 1x1
L2-Normalize
CCL Loss
Conv: 256 3x3
Conv: 512 1x1/ReLU
Conv: 512 3x3
Concat
Upsample x 2
Conv: 512 3x3
Conv: 512 1x1/ReLU
Concat
(a) convi-net (i = 3) [1] (b) hypercolumn-fusion [2] (c) topdown-fusion [3]
2D Correspondence Results
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
Threshold (pixel)
Accura
cy (
PC
K)
conv1−net
conv2−net
conv3−net
conv4−net
conv5−net
hypercolumn−fusion
topdown−fusion
HiLM (conv2+conv3)
HiLM (conv2+conv4)
HiLM (conv2+conv5)
HiLM (conv2+conv5) Sintel
10 20 30 40 50 60 70 80 90 10020
30
40
50
60
70
80
90
100
Threshold (pixel)
Accu
racy (
PC
K)
conv1−net
conv2−net
conv3−net
conv4−net
conv5−net
hypercolumn−fusion
topdown−fusion
HiLM (conv2+conv3)
HiLM (conv2+conv4)
HiLM (conv2+conv5)
HiLM (conv2+conv5) Sintel
Comparison of different CNN-based methods for 2D corre-spondence estimation on KITTIFlow 2015.
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
Threshold (pixel)
Accu
racy (
PC
K)
SIFT
DAISY
KAZE
UCN
HiLM (VGG−M)
HiLM (GoogLeNet)
10 20 30 40 50 60 70 80 90 10010
20
30
40
50
60
70
80
90
100
Threshold (pixel)
Accu
racy (
PC
K)
SIFT
DAISY
KAZE
UCN
HiLM (VGG−M)
HiLM (GoogLeNet)
Comparison of CNN-based andhand-crafted methods for 2Dcorrespondence estimation onKITTI Flow 2015.
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
Threshold (pixel)
Accu
racy (
PC
K)
hypercolumn−fusion
topdown−fusion
HiLM (conv2+conv3)
HiLM (conv2+conv4)
HiLM (conv2+conv5)
HiLM (conv2+conv3) Sintel
HiLM (conv2+conv4) Sintel
HiLM (conv2+conv5) Sintel
10 20 30 40 50 60 70 80 90 10070
75
80
85
90
95
100
Threshold (pixel)
Accu
racy (
PC
K)
hypercolumn−fusion
topdown−fusion
HiLM (conv2+conv3)
HiLM (conv2+conv4)
HiLM (conv2+conv5)
HiLM (conv2+conv3) Sintel
HiLM (conv2+conv4) Sintel
HiLM (conv2+conv5) Sintel
Generalization results whentraining on MPI Sintel andevaluating on KITTI Flow2015.
†Code and models will be made available at http://www.nec-labs.com/∼mas/HiLM/.
Optical Flow Results
Method Fl-bg Fl-fg Fl-allFlowNet2 10.75% 8.75% 10.41%
SDF 8.61% 26.69% 11.62%SOF 14.63% 27.73% 16.81%
CNN-HPM 18.33% 24.96% 19.44%HiLM (Ours) 23.73% 21.79% 23.41%
SPM-BP 24.06% 24.97% 24.21%FullFlow 23.09% 30.11% 24.26%
AutoScaler [4] 21.85% 31.62% 25.64%EpicFlow 25.81% 33.56% 27.10%
DeepFlow2 27.96% 35.28% 29.18%PatchCollider 30.60% 33.09% 31.01%
Quantitative results on KITTI Flow 2015. AutoScaler[4] is an image-pyramid/multi-scale 2D correspon-dence estimation method. Bold represents best result,underlined depicts second best.
Qualitative results on KITTI Flow 2015. 1st row: in-put images. 2nd row: DeepFlow2. 3rd row: EpicFlow.4th row: SPM-BP. 5th row: HiLM (Ours). Red depictshigh error, blue represents low error.
3D CNN Architecture for 3D Correspondence Estimation
Conv-8: 512, 3^3, S1
Conv-7: 512, 3^3, S1
Conv-6: 256, 3^3, S1
Conv-5: 256, 3^3, S1
Conv-4: 128, 3^3, S1
Conv-3: 128, 3^3, S1
Conv-2: 64, 3^3, S1
Conv-1: 64, 3^3, S1
Pool-1: 64, 2^3, S2
TDF voxel grid
CCL Loss
30^3 local patch
Conv-2b: 256, 3^3, S2
Conv-2a: 128, 3^3, S2
FE: 512, 5^3, S1 FE: 512, 5^3, S1
CCL LossFE: 512, 1^3, S1 FE: 512, 1^3, S1
Constrained Matching
Coarse Matching
Conv-8: 512, 3^3, S1
Conv-7: 512, 3^3, S1
Conv-6: 256, 3^3, S1
Conv-5: 256, 3^3, S1
Conv-4: 128, 3^3, S1
Conv-3: 128, 3^3, S1
Conv-2: 64, 3^3, S1
Conv-1: 64, 3^3, S1
Pool-1: 64, 2^3, S2
TDF voxel grid
30^3 local patch
Conv-2b: 256, 3^3, S2
Conv-2a: 128, 3^3, S2
Testing
Training
Training
Precise matches
For a fair comparison with3DMatch [5], a CNN-based ap-proach to 3D correspondence es-timation, we disable hard nega-tive mining in our 3D CNN ar-chitecture.
3D Correspondence Results
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
Threshold (cm)
Accu
racy
(PC
K)
3DMatchHiL (conv2)HiL (conv8)HiLM (conv2+conv8)
10 15 20 2520
30
40
50
60
70
80
90
100
Threshold (cm)
Accu
racy
(PC
K)
3DMatchHiL (conv2)HiL (conv8)HiLM (conv2+conv8)
Comparison of different CNN-based methods for 3D corre-spondence estimation.
References[1] Choy et al.: Universal Correspondence Network. In: NIPS (2016)[2] Hariharan et al.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)[3] Pinheiro et al.: Learning to refine object segments. In: ECCV (2016)[4] Wang et al.: AutoScaler: Scale-Attention Networks for Visual Correspondence. In: BMVC (2017)
[5] Zeng et al.: 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In: CVPR (2017)