hierarchicalmetriclearningandmatchingfor2dand3dgeometriccorrespondenceshuy/lib/exe/fetch.php?media=... ·...

1
Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences ?Mohammed E. Fathy 1 , Quoc-Huy Tran 2 , M. Zeeshan Zia 3 , Paul Vernaza 2 , and Manmohan Chandraker 2,4 1 Google Cloud AI 2 NEC Laboratories America, Inc. 3 Microsoft 4 University of California, San Diego Problems & Contributions Semantic features + Localization robust to appearance variations Geometric features + Localization sensitive to local structure Hierarchical metric Learning and Matching (HiLM) Our hierarchical metric learning and matching (HiLM ) retain the best properties of various levels of abstrac- tion in CNN feature representations. Corresponding pixels are overlaid on the input images with the same col- ors. Problems: Given two RGB images capturing a common scene from different viewpoints, our task is to estimate 2D geometric correspondences between the pixels in the images. In addition, we propose an extension to computing 3D geometric correspondences from two point clouds. Contributions: We derive useful insights for designing CNN-based correspondence methods, where both deep and shallow features are exploited. Hierarchical metric losses at multiple CNN layers for correspondence learning. Hierarchical matching within CNN activation maps for refining correspondences. Experimental results show significant improvements over previous single-layer and feature fusion approaches. The improvements further translate to different data modalities (2D images, 3D point clouds) and generalize across various datasets (Sintel, HPatches, KITTI). 2D CNN Architecture for 2D Correspondence Estimation ReLU/LRN/Max-Pool: 3x3/S2 Conv-2: 256 5x5/S1 Conv-3: 512 3x3/D4 Conv-4: 512 3x3/D4 ReLU/LRN/Max-Pool: 5x5/S1 Conv-5: 512 3x3/D4 Conv-1: 96 7x7/S2 ReLU/LRN/Max-Pool: 3x3/S2 Conv-2: 256 5x5/S1 Conv-3: 512 3x3/D4 Conv-4: 512 3x3/D4 ReLU/LRN/Max-Pool: 5x5/S1 Conv-5: 512 3x3/D4 Conv-1: 96 7x7/S2 FE-Conv: 128 3x3 L2-Normalize FE-Conv: 128 3x3 L2-Normalize Hard-Negative Mining CCL Loss FE-Conv: 128 1x1 FE-Conv: 128 1x1 L2-Normalize L2-Normalize Hard-Negative Mining CCL Loss Coarse Matching Constrained Matching Precise matches Tesng Training Deep supervision improves generalization by encouraging the CNN early on to learn task-relevant features. Also, both deep and shallow layers can be supervised simultaneously within one CNN. Hard Negative Mining happens “on-the-fly” to speed up training and leverage the latest instance of network weights. Hard negative mining is employed independently for each of the feature levels being supervised. Correspondence Contrastive Loss (CCL): L := L X l=1 X (x,x 0 ,y )∈D y.d 2 l (x, x 0 ) + (1 - y ) . (max(0,m - d l (x, x 0 ))) 2 . ? Part of this work was done during M. E. Fathy’s internship at NEC Labs America in Cupertino, CA. The authors thank C. B. Choy and A. Zeng for their help with the code of UCN and 3DMatch respectively. Baseline Architectures Inspired by Previous Works ReLU/LRN/Max-Pool: 3x3/S2 Conv-2: 256 5x5/S1 Conv-3: 512 3x3/D4 ReLU/LRN/Max-Pool: 5x5/S1 Conv-1: 96 7x7/S2 L2-Normalize CCL Loss ReLU/LRN/Max-Pool: 3x3/S2 Conv-2: 256 5x5/S1 Conv-3: 512 3x3/D4 Conv-4: 512 3x3/D4 ReLU/LRN/Max-Pool: 5x5/S1 Conv-5: 512 3x3/D4 Conv-1: 96 7x7/S2 Max-Pool: 1x1/S2 Concat FE-Conv: 128 1x1 L2-Normalize CCL Loss ReLU/LRN/Max-Pool: 3x3/S2 Conv-2: 256 5x5/S1 Conv-3: 512 3x3/D2 Conv-4: 512 3x3/D2/S2 ReLU/LRN/Max-Pool: 5x5/S2 Conv-5: 512 3x3 Conv-1: 96 7x7/S2 Upsample x 2 Concat FE-Conv: 128 1x1 L2-Normalize CCL Loss Conv: 256 3x3 Conv: 512 1x1/ReLU Conv: 512 3x3 Concat Upsample x 2 Conv: 512 3x3 Conv: 512 1x1/ReLU Concat (a) convi-net (i = 3) [1] (b) hypercolumn-fusion [2] (c) topdown-fusion [3] 2D Correspondence Results 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90 Threshold (pixel) Accuracy (PCK) conv1-net conv2-net conv3-net conv4-net conv5-net hypercolumn-fusion topdown-fusion HiLM (conv2+conv3) HiLM (conv2+conv4) HiLM (conv2+conv5) HiLM (conv2+conv5) Sintel 10 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 Threshold (pixel) Accuracy (PCK) conv1-net conv2-net conv3-net conv4-net conv5-net hypercolumn-fusion topdown-fusion HiLM (conv2+conv3) HiLM (conv2+conv4) HiLM (conv2+conv5) HiLM (conv2+conv5) Sintel Comparison of different CNN- based methods for 2D corre- spondence estimation on KITTI Flow 2015. 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90 Threshold (pixel) Accuracy (PCK) SIFT DAISY KAZE UCN HiLM (VGG-M) HiLM (GoogLeNet) 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Threshold (pixel) Accuracy (PCK) SIFT DAISY KAZE UCN HiLM (VGG-M) HiLM (GoogLeNet) Comparison of CNN-based and hand-crafted methods for 2D correspondence estimation on KITTI Flow 2015. 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90 Threshold (pixel) Accuracy (PCK) hypercolumn-fusion topdown-fusion HiLM (conv2+conv3) HiLM (conv2+conv4) HiLM (conv2+conv5) HiLM (conv2+conv3) Sintel HiLM (conv2+conv4) Sintel HiLM (conv2+conv5) Sintel 10 20 30 40 50 60 70 80 90 100 70 75 80 85 90 95 100 Threshold (pixel) Accuracy (PCK) hypercolumn-fusion topdown-fusion HiLM (conv2+conv3) HiLM (conv2+conv4) HiLM (conv2+conv5) HiLM (conv2+conv3) Sintel HiLM (conv2+conv4) Sintel HiLM (conv2+conv5) Sintel Generalization results when training on MPI Sintel and evaluating on KITTI Flow 2015. Code and models will be made available at http://www.nec-labs.com/mas/HiLM/. Optical Flow Results Method Fl-bg Fl-fg Fl-all FlowNet2 10.75 % 8.75% 10.41% SDF 8.61% 26.69% 11.62 % SOF 14.63% 27.73% 16.81% CNN-HPM 18.33% 24.96% 19.44% HiLM (Ours) 23.73% 21.79 % 23.41% SPM-BP 24.06% 24.97% 24.21% FullFlow 23.09% 30.11% 24.26% AutoScaler [4] 21.85% 31.62% 25.64% EpicFlow 25.81% 33.56% 27.10% DeepFlow2 27.96% 35.28% 29.18% PatchCollider 30.60% 33.09% 31.01% Quantitative results on KITTI Flow 2015. AutoScaler [4] is an image-pyramid/multi-scale 2D correspon- dence estimation method. Bold represents best result, underlined depicts second best. Qualitative results on KITTI Flow 2015. 1st row: in- put images. 2nd row: DeepFlow2. 3rd row: EpicFlow. 4th row: SPM-BP. 5th row: HiLM (Ours). Red depicts high error, blue represents low error. 3D CNN Architecture for 3D Correspondence Estimation Conv-8: 512, 3^3, S1 Conv-7: 512, 3^3, S1 Conv-6: 256, 3^3, S1 Conv-5: 256, 3^3, S1 Conv-4: 128, 3^3, S1 Conv-3: 128, 3^3, S1 Conv-2: 64, 3^3, S1 Conv-1: 64, 3^3, S1 Pool-1: 64, 2^3, S2 TDF voxel grid CCL Loss 30^3 local patch Conv-2b: 256, 3^3, S2 Conv-2a: 128, 3^3, S2 FE: 512, 5^3, S1 FE: 512, 5^3, S1 CCL Loss FE: 512, 1^3, S1 FE: 512, 1^3, S1 Constrained Matching Coarse Matching Conv-8: 512, 3^3, S1 Conv-7: 512, 3^3, S1 Conv-6: 256, 3^3, S1 Conv-5: 256, 3^3, S1 Conv-4: 128, 3^3, S1 Conv-3: 128, 3^3, S1 Conv-2: 64, 3^3, S1 Conv-1: 64, 3^3, S1 Pool-1: 64, 2^3, S2 TDF voxel grid 30^3 local patch Conv-2b: 256, 3^3, S2 Conv-2a: 128, 3^3, S2 Testing Training Training Precise matches For a fair comparison with 3DMatch [5], a CNN-based ap- proach to 3D correspondence es- timation, we disable hard nega- tive mining in our 3D CNN ar- chitecture. 3D Correspondence Results 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30 35 Threshold (cm) Accuracy (PCK) 3DMatch HiL (conv2) HiL (conv8) HiLM (conv2+conv8) 10 15 20 25 20 30 40 50 60 70 80 90 100 Threshold (cm) Accuracy (PCK) 3DMatch HiL (conv2) HiL (conv8) HiLM (conv2+conv8) Comparison of different CNN- based methods for 3D corre- spondence estimation. References [1] Choy et al.: Universal Correspondence Network. In: NIPS (2016) [2] Hariharan et al.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015) [3] Pinheiro et al.: Learning to refine object segments. In: ECCV (2016) [4] Wang et al.: AutoScaler: Scale-Attention Networks for Visual Correspondence. In: BMVC (2017) [5] Zeng et al.: 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In: CVPR (2017)

Upload: others

Post on 26-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HierarchicalMetricLearningandMatchingfor2Dand3DGeometricCorrespondenceshuy/lib/exe/fetch.php?media=... · 2018-09-07 · Mohammed E. Fathy1, Quoc-Huy Tran2, M. Zeeshan Zia3, Paul

HierarchicalMetric Learning andMatching for 2D and 3DGeometric Correspondences?†

Mohammed E. Fathy1, Quoc-Huy Tran2, M. Zeeshan Zia3, Paul Vernaza2, and Manmohan Chandraker2,4

1Google Cloud AI 2NEC Laboratories America, Inc. 3Microsoft 4University of California, San Diego

Problems & Contributions

Semantic features+

Localization robust to appearance

variations

Geometric features

+Localization

sensitive to local structure

Hierarchical metric Learning and Matching (HiLM)

Our hierarchical metric learning andmatching (HiLM ) retain the bestproperties of various levels of abstrac-tion in CNN feature representations.Corresponding pixels are overlaid onthe input images with the same col-ors.

Problems: Given two RGB images capturing a common scene from different viewpoints, our task is to estimate2D geometric correspondences between the pixels in the images. In addition, we propose an extension tocomputing 3D geometric correspondences from two point clouds.

Contributions:• We derive useful insights for designing CNN-based correspondence methods, where both deep and shallow

features are exploited.

• Hierarchical metric losses at multiple CNN layers for correspondence learning.

• Hierarchical matching within CNN activation maps for refining correspondences.

• Experimental results show significant improvements over previous single-layer and feature fusion approaches.

• The improvements further translate to different data modalities (2D images, 3D point clouds) and generalizeacross various datasets (Sintel, HPatches, KITTI).

2D CNN Architecture for 2D Correspondence Estimation

ReLU/LRN/Max-Pool: 3x3/S2

Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

Conv-4: 512 3x3/D4

ReLU/LRN/Max-Pool: 5x5/S1

Conv-5: 512 3x3/D4

Conv-1: 96 7x7/S2

ReLU/LRN/Max-Pool: 3x3/S2

Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

Conv-4: 512 3x3/D4

ReLU/LRN/Max-Pool: 5x5/S1

Conv-5: 512 3x3/D4

Conv-1: 96 7x7/S2

FE-Conv: 128 3x3

L2-Normalize

FE-Conv: 128 3x3

L2-Normalize Hard-Negative

Mining

CCL Loss

FE-Conv: 128 1x1 FE-Conv: 128 1x1

L2-Normalize L2-Normalize

Hard-Negative Mining

CCL Loss

Coarse Matching

Constrained Matching

Precise matches

Testing

Training

Deep supervision improves generalization by encouraging the CNN early on to learn task-relevant features.Also, both deep and shallow layers can be supervised simultaneously within one CNN.

Hard Negative Mining happens “on-the-fly” to speed up training and leverage the latest instance of networkweights. Hard negative mining is employed independently for each of the feature levels being supervised.

Correspondence Contrastive Loss (CCL):

L :=L∑

l=1

∑(x,x′,y)∈D

y . d2l (x, x′) + (1− y) . (max(0,m− dl(x, x′)))2.

?Part of this work was done during M. E. Fathy’s internship at NEC Labs America in Cupertino, CA.The authors thank C. B. Choy and A. Zeng for their help with the code of UCN and 3DMatch respectively.

Baseline Architectures Inspired by Previous Works

ReLU/LRN/Max-Pool: 3x3/S2

Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

ReLU/LRN/Max-Pool: 5x5/S1

Conv-1: 96 7x7/S2

L2-Normalize

CCL Loss

ReLU/LRN/Max-Pool: 3x3/S2

Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

Conv-4: 512 3x3/D4

ReLU/LRN/Max-Pool: 5x5/S1

Conv-5: 512 3x3/D4

Conv-1: 96 7x7/S2 Max-Pool: 1x1/S2

Concat

FE-Conv:

128 1x1

L2-Normalize

CCL Loss

(a) conv3-net (b) hypercolumn-fusion

ReLU/LRN/Max-Pool: 3x3/S2

Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D2

Conv-4: 512 3x3/D2/S2

ReLU/LRN/Max-Pool: 5x5/S2

Conv-5: 512 3x3

Conv-1: 96 7x7/S2

Upsample x 2

Concat

FE-Conv: 128 1x1

L2-Normalize

CCL Loss

Conv: 256 3x3

Conv: 512 1x1/ReLU

Conv: 512 3x3

Concat

Upsample x 2

Conv: 512 3x3

Conv: 512 1x1/ReLU

Concat

(a) convi-net (i = 3) [1] (b) hypercolumn-fusion [2] (c) topdown-fusion [3]

2D Correspondence Results

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

Threshold (pixel)

Accura

cy (

PC

K)

conv1−net

conv2−net

conv3−net

conv4−net

conv5−net

hypercolumn−fusion

topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)

HiLM (conv2+conv5) Sintel

10 20 30 40 50 60 70 80 90 10020

30

40

50

60

70

80

90

100

Threshold (pixel)

Accu

racy (

PC

K)

conv1−net

conv2−net

conv3−net

conv4−net

conv5−net

hypercolumn−fusion

topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)

HiLM (conv2+conv5) Sintel

Comparison of different CNN-based methods for 2D corre-spondence estimation on KITTIFlow 2015.

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

Threshold (pixel)

Accu

racy (

PC

K)

SIFT

DAISY

KAZE

UCN

HiLM (VGG−M)

HiLM (GoogLeNet)

10 20 30 40 50 60 70 80 90 10010

20

30

40

50

60

70

80

90

100

Threshold (pixel)

Accu

racy (

PC

K)

SIFT

DAISY

KAZE

UCN

HiLM (VGG−M)

HiLM (GoogLeNet)

Comparison of CNN-based andhand-crafted methods for 2Dcorrespondence estimation onKITTI Flow 2015.

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

Threshold (pixel)

Accu

racy (

PC

K)

hypercolumn−fusion

topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)

HiLM (conv2+conv3) Sintel

HiLM (conv2+conv4) Sintel

HiLM (conv2+conv5) Sintel

10 20 30 40 50 60 70 80 90 10070

75

80

85

90

95

100

Threshold (pixel)

Accu

racy (

PC

K)

hypercolumn−fusion

topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)

HiLM (conv2+conv3) Sintel

HiLM (conv2+conv4) Sintel

HiLM (conv2+conv5) Sintel

Generalization results whentraining on MPI Sintel andevaluating on KITTI Flow2015.

†Code and models will be made available at http://www.nec-labs.com/∼mas/HiLM/.

Optical Flow Results

Method Fl-bg Fl-fg Fl-allFlowNet2 10.75% 8.75% 10.41%

SDF 8.61% 26.69% 11.62%SOF 14.63% 27.73% 16.81%

CNN-HPM 18.33% 24.96% 19.44%HiLM (Ours) 23.73% 21.79% 23.41%

SPM-BP 24.06% 24.97% 24.21%FullFlow 23.09% 30.11% 24.26%

AutoScaler [4] 21.85% 31.62% 25.64%EpicFlow 25.81% 33.56% 27.10%

DeepFlow2 27.96% 35.28% 29.18%PatchCollider 30.60% 33.09% 31.01%

Quantitative results on KITTI Flow 2015. AutoScaler[4] is an image-pyramid/multi-scale 2D correspon-dence estimation method. Bold represents best result,underlined depicts second best.

Qualitative results on KITTI Flow 2015. 1st row: in-put images. 2nd row: DeepFlow2. 3rd row: EpicFlow.4th row: SPM-BP. 5th row: HiLM (Ours). Red depictshigh error, blue represents low error.

3D CNN Architecture for 3D Correspondence Estimation

Conv-8: 512, 3^3, S1

Conv-7: 512, 3^3, S1

Conv-6: 256, 3^3, S1

Conv-5: 256, 3^3, S1

Conv-4: 128, 3^3, S1

Conv-3: 128, 3^3, S1

Conv-2: 64, 3^3, S1

Conv-1: 64, 3^3, S1

Pool-1: 64, 2^3, S2

TDF voxel grid

CCL Loss

30^3 local patch

Conv-2b: 256, 3^3, S2

Conv-2a: 128, 3^3, S2

FE: 512, 5^3, S1 FE: 512, 5^3, S1

CCL LossFE: 512, 1^3, S1 FE: 512, 1^3, S1

Constrained Matching

Coarse Matching

Conv-8: 512, 3^3, S1

Conv-7: 512, 3^3, S1

Conv-6: 256, 3^3, S1

Conv-5: 256, 3^3, S1

Conv-4: 128, 3^3, S1

Conv-3: 128, 3^3, S1

Conv-2: 64, 3^3, S1

Conv-1: 64, 3^3, S1

Pool-1: 64, 2^3, S2

TDF voxel grid

30^3 local patch

Conv-2b: 256, 3^3, S2

Conv-2a: 128, 3^3, S2

Testing

Training

Training

Precise matches

For a fair comparison with3DMatch [5], a CNN-based ap-proach to 3D correspondence es-timation, we disable hard nega-tive mining in our 3D CNN ar-chitecture.

3D Correspondence Results

1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

Threshold (cm)

Accu

racy

(PC

K)

3DMatchHiL (conv2)HiL (conv8)HiLM (conv2+conv8)

10 15 20 2520

30

40

50

60

70

80

90

100

Threshold (cm)

Accu

racy

(PC

K)

3DMatchHiL (conv2)HiL (conv8)HiLM (conv2+conv8)

Comparison of different CNN-based methods for 3D corre-spondence estimation.

References[1] Choy et al.: Universal Correspondence Network. In: NIPS (2016)[2] Hariharan et al.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)[3] Pinheiro et al.: Learning to refine object segments. In: ECCV (2016)[4] Wang et al.: AutoScaler: Scale-Attention Networks for Visual Correspondence. In: BMVC (2017)

[5] Zeng et al.: 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In: CVPR (2017)